Awesome
Understanding Embodied Reference with Touch-Line Transformer
Code for ICLR 2023 paper Understanding Embodied Reference with Touch-Line Transformer.
Authors: Yang Li, Xiaoxue Chen, Hao Zhao, Jiangtao Gong, Guyue Zhou, Federico Rossano, Yixin Zhu
Project Structure
Project_NAME/
├── process_masks_and_images_for_MAT.ipynb/
├── main_ref.py/
├── pretrained/
│ ├── 20_query_model.pth/
│ ├── best_etf.pth/
│ ├── best_arm.pth/
│ ├── best_np.pth/
│ └── best_ip.pth/
├── predictions/
│ ├── arm.csv/
│ ├── eye-to-fingertip.csv/
│ ├── inpaint.csv/
│ └── no_pose.csv/
└── yourefit
├── images/
├── pickle/
├── paf/
├── saliency/
├── inpaint_Place_using_expanded_masks/
├── eye_to_fingertip/
│ ├── eye_to_fingertip_annotations_train.csv/
│ ├── eye_to_fingertip_annotations_valid.csv/
│ ├── train_names.txt/
│ └── valid_names.txt.txt/
└── arm/
pretrained: a directory contains checkpoints.
pretrained/20_query_model.pth: we sliced (from 100 queries to 20 queries) the checkpoint of the checkpoint provided by the authors of MDETR
yourefit: a directory that contains the downloaded YouRefIt dataset. This directory will also contain inpainitings produced by readers. (Refer to the "inpainting" section for how to produce inpaintings).
yourefit/eye_to_fingertip: a directory containing annotations for eyes and fingertips
yourefit/arm: annotations for arms.
File that Pertains to the Scientific Claims the Most
models/mdetr.py
Environment and Data
1. install dependencies
conda create --name nvvc python=3.8
conda activate nvvc
pip install -r requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
2. download data
- Download YouRefIt images and annotations as yourefit.zip
- unzip yourefit.zip outside of this project and get a folder named "yourefit"
- move or copy "images", "pickle", "paf", and "saliency" in the "yourefit" outside of this project into the existing "yourefit" folder inside this project
3. download checkpoints (pre-trained models)
Use hyperlinks in checkpoint column of the table below, and put them into the directory named "pretrained", which is under project root (refer to the "project structure" section above)
Model | precision: IoU=0.25 | precision: IoU=0.50 | precision: IoU=0.75 | checkpoint |
---|---|---|---|---|
eye + fingertip | 0.7002398081534772 | 0.6250999200639489 | 0.3820943245403677 | best_etf.pth |
elbow joint + wrist | 0.6786570743405276 | 0.5971223021582733 | 0.34772182254196643 | best_arm.pth |
no explicit pose | 0.6370903277378097 | 0.5651478816946442 | 0.36211031175059955 | best_np.pth |
inpainting | 0.5787370103916867 | 0.5091926458832934 | 0.31414868105515587 | best_ip.pth |
MDETR | - | - | - | 20_query_model.pth |
4. (optional) generate inpaintings
We provide jupyter notebooks to expand the human masks required for inpainting,
Readers need to generate humans masks by themselves using F-RCNN because the
yourefit dataset does not include human masks. The mask generation process
is straightforward. Readers can refer to the github repo created by the authors
of F-RCNNs for how to generate human masks. We only provide notebooks
to expand and resize masks. Download by clicking the hyperlink.
process_masks_and_images_for_MAT.ipynb
After generating masks using the notebook, readers may, or may not, need to flip the values (e.g. change 255 to 0 and 0 to 255) in the output masks, depending on how readers generated humans masks using F-RCNN. After that, feed the masks and images to the model MAT for inpainting.\
After inpainting, readers may need to resize the inpaintings back to the
sizes of the original images because the input and output of MAT are squares.
If readers reshaped expanded masks to square (instead of masking them) before
feedings them into MAT, readers need to reshape the MAT output back to
original sizes. In contrast, if readers choose to mask, readers can process the
MAT outputs by cropping them. We only provide the notebook to reshape square
outputs back to the sizes of original images.
restore_inpaint_size.ipynb
Note that readers need to modify the image_dir, inpaint_dir, and output_dir in
the notebook provided above.
(image_dir is the path to yourefit images. the shape of the original images
will be used. inpaint dir is the path to the MAT outputs. output_dir
is the path to store the inpainted images that are reshaped to the sizes of
original images by the above notebook)
Finally, after obtaining inpaintings, change the INPAINT_DIR in magic_numbers.py to the path of the inpainted images that are reshaped to sizes of origianl images. Note that INPAINT_DIR is a relative path (relative to Project_NAME/yourefit. Please refer to the project structure section).
Evaluate
eye + fingertip
use the unmodified magic_numbers.py and run:
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_etf.pth --eval
elbow joint + wrist
before running, in unmodified magic_numbers.py, set:
REPLACE_ARM_WITH_EYE_TO_FINGERTIP = False
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_arm.pth --eval
no explicit pose
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_np.pth --eval --pose False
inpainting (requires inpaintings)
(requires generated inpintings, see the optional "generate inpaintings" section in the "reproduction" section)
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
REPLACE_IMAGES_WITH_INPAINT = True
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_ip.pth --eval --pose False
Train
eye + fingertip
use the unmodified magic_numbers.py and run:
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64331 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_etf' --load pretrained/20_query_model.pth
elbow joint + wrist
before running, in unmodified magic_numbers.py, set:
REPLACE_ARM_WITH_EYE_TO_FINGERTIP = False
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64332 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_arm' --load pretrained/20_query_model.pth
no explicit pose
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64333 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_np' --load pretrained/20_query_model.pth --pose False
inpainting
(requires generated inpintings, see the optional "generate inpaintings" section in the "reproduction" section)
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
REPLACE_IMAGES_WITH_INPAINT = True
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64334 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_ip' --load pretrained/20_query_model.pth --pose False
Visualizations
We provide jupyter notebooks to visualize predictions stored in csv files, which can be obtained by:
setting SAVE_EVALUATION_PREDICTIONS = True and run any of the evaluation command provided in the evaluation section above.
cleaned_visualize_predictions_eye_to_fingertip.ipynb
cleaned_visualize_predictions_elbow_joint_to_wrist.ipynb
cleaned_visualize_predictions_no-pose.ipynb
cleaned_visualize_predictions_inpaint.ipynb
Dataset
We annotated eyes, fingertips, elbows, and wrists. They are under the yourefit folder of this repo. Eye and fingertip locations are stored in csv files. Elbows and wrist locations are stored in a json file.