Awesome

Understanding Embodied Reference with Touch-Line Transformer

Code for ICLR 2023 paper Understanding Embodied Reference with Touch-Line Transformer.
Authors: Yang Li, Xiaoxue Chen, Hao Zhao, Jiangtao Gong, Guyue Zhou, Federico Rossano, Yixin Zhu

Project Structure

Project_NAME/
    ├── process_masks_and_images_for_MAT.ipynb/
    ├── main_ref.py/
    ├── pretrained/
    │   ├── 20_query_model.pth/
    │   ├── best_etf.pth/
    │   ├── best_arm.pth/
    │   ├── best_np.pth/
    │   └── best_ip.pth/
    ├── predictions/
    │   ├── arm.csv/
    │   ├── eye-to-fingertip.csv/
    │   ├── inpaint.csv/
    │   └── no_pose.csv/
    └── yourefit
        ├── images/
        ├── pickle/
        ├── paf/
        ├── saliency/
        ├── inpaint_Place_using_expanded_masks/
        ├── eye_to_fingertip/
        │   ├── eye_to_fingertip_annotations_train.csv/
        │   ├── eye_to_fingertip_annotations_valid.csv/
        │   ├── train_names.txt/
        │   └── valid_names.txt.txt/
        └── arm/

pretrained: a directory contains checkpoints.

pretrained/20_query_model.pth: we sliced (from 100 queries to 20 queries) the checkpoint of the checkpoint provided by the authors of MDETR

yourefit: a directory that contains the downloaded YouRefIt dataset. This directory will also contain inpainitings produced by readers. (Refer to the "inpainting" section for how to produce inpaintings).

yourefit/eye_to_fingertip: a directory containing annotations for eyes and fingertips

yourefit/arm: annotations for arms.

File that Pertains to the Scientific Claims the Most

models/mdetr.py

Environment and Data

1. install dependencies

conda create --name nvvc python=3.8
conda activate nvvc
pip install -r requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

2. download data

Download YouRefIt images and annotations as yourefit.zip
unzip yourefit.zip outside of this project and get a folder named "yourefit"
move or copy "images", "pickle", "paf", and "saliency" in the "yourefit" outside of this project into the existing "yourefit" folder inside this project

3. download checkpoints (pre-trained models)

Use hyperlinks in checkpoint column of the table below, and put them into the directory named "pretrained", which is under project root (refer to the "project structure" section above)

Model	precision: IoU=0.25	precision: IoU=0.50	precision: IoU=0.75	checkpoint
eye + fingertip	0.7002398081534772	0.6250999200639489	0.3820943245403677	best_etf.pth
elbow joint + wrist	0.6786570743405276	0.5971223021582733	0.34772182254196643	best_arm.pth
no explicit pose	0.6370903277378097	0.5651478816946442	0.36211031175059955	best_np.pth
inpainting	0.5787370103916867	0.5091926458832934	0.31414868105515587	best_ip.pth
MDETR	-	-	-	20_query_model.pth

4. (optional) generate inpaintings

We provide jupyter notebooks to expand the human masks required for inpainting, Readers need to generate humans masks by themselves using F-RCNN because the yourefit dataset does not include human masks. The mask generation process is straightforward. Readers can refer to the github repo created by the authors of F-RCNNs for how to generate human masks. We only provide notebooks to expand and resize masks. Download by clicking the hyperlink.
process_masks_and_images_for_MAT.ipynb

After generating masks using the notebook, readers may, or may not, need to flip the values (e.g. change 255 to 0 and 0 to 255) in the output masks, depending on how readers generated humans masks using F-RCNN. After that, feed the masks and images to the model MAT for inpainting.\

After inpainting, readers may need to resize the inpaintings back to the sizes of the original images because the input and output of MAT are squares. If readers reshaped expanded masks to square (instead of masking them) before feedings them into MAT, readers need to reshape the MAT output back to original sizes. In contrast, if readers choose to mask, readers can process the MAT outputs by cropping them. We only provide the notebook to reshape square outputs back to the sizes of original images.
restore_inpaint_size.ipynb
Note that readers need to modify the image_dir, inpaint_dir, and output_dir in the notebook provided above. (image_dir is the path to yourefit images. the shape of the original images will be used. inpaint dir is the path to the MAT outputs. output_dir is the path to store the inpainted images that are reshaped to the sizes of original images by the above notebook)

Finally, after obtaining inpaintings, change the INPAINT_DIR in magic_numbers.py to the path of the inpainted images that are reshaped to sizes of origianl images. Note that INPAINT_DIR is a relative path (relative to Project_NAME/yourefit. Please refer to the project structure section).

Evaluate

eye + fingertip

use the unmodified magic_numbers.py and run:

python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_etf.pth --eval

elbow joint + wrist

before running, in unmodified magic_numbers.py, set:
REPLACE_ARM_WITH_EYE_TO_FINGERTIP = False

python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_arm.pth --eval

no explicit pose

before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0

python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_np.pth --eval --pose False

inpainting (requires inpaintings)

(requires generated inpintings, see the optional "generate inpaintings" section in the "reproduction" section) before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
REPLACE_IMAGES_WITH_INPAINT = True

python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_ip.pth --eval --pose False

Train

eye + fingertip

use the unmodified magic_numbers.py and run:

python -m torch.distributed.launch --nproc_per_node=8 --master_port 64331 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_etf' --load pretrained/20_query_model.pth

elbow joint + wrist

before running, in unmodified magic_numbers.py, set:
REPLACE_ARM_WITH_EYE_TO_FINGERTIP = False

python -m torch.distributed.launch --nproc_per_node=8 --master_port 64332 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_arm' --load pretrained/20_query_model.pth

no explicit pose

before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0

python -m torch.distributed.launch --nproc_per_node=8 --master_port 64333 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_np' --load pretrained/20_query_model.pth --pose False

inpainting

python -m torch.distributed.launch --nproc_per_node=8 --master_port 64334 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7   --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_ip' --load pretrained/20_query_model.pth --pose False

Visualizations

We provide jupyter notebooks to visualize predictions stored in csv files, which can be obtained by:
setting SAVE_EVALUATION_PREDICTIONS = True and run any of the evaluation command provided in the evaluation section above.

cleaned_visualize_predictions_eye_to_fingertip.ipynb
cleaned_visualize_predictions_elbow_joint_to_wrist.ipynb
cleaned_visualize_predictions_no-pose.ipynb
cleaned_visualize_predictions_inpaint.ipynb

Dataset

We annotated eyes, fingertips, elbows, and wrists. They are under the yourefit folder of this repo. Eye and fingertip locations are stored in csv files. Elbows and wrist locations are stored in a json file.