Home

Awesome

License Framework Open In Colab Open in Hugging Face Spaces

This repo contains the official implementation of the CVPR 2022 paper:

<div align="center"> <h1> <b> End-to-End Referring Video Object Segmentation<br> with Multimodal Transformers </b> </h1> <h4> <b> Adam Botach, Evgenii Zheltonozhskii, Dr. Chaim Baskin

Technion ā€“ Israel Institute of Technology </b>

</h4> </div>

https://github.com/mttr2021/MTTR/assets/94481888/d5d7d014-9c4e-4062-9c32-fa93061898b5

Updates

3/3/2022

We are excited to announce that our paper was accepted for publication at CVPR 2022! šŸ„³šŸ„³šŸ„³

The paper can be accessed here.

8/12/2021

We listened to your requests and now release interactive demonstrations of MTTR on Google Colab and Hugging Face Spaces! šŸš€ šŸ¤—

We currently recommend using the Colab version of the demonstration as it is a lot faster (GPU accelerated) and has more options. The Spaces demo on the other hand has a nicer interface but is currently much slower since it runs on CPU.

Enjoy! :)

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

pip install transformers==4.11.3

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
ā””ā”€ā”€ a2d_sentences/ 
    ā”œā”€ā”€ Release/
    ā”‚   ā”œā”€ā”€ videoset.csv  (videos metadata file)
    ā”‚   ā””ā”€ā”€ CLIPS320/
    ā”‚       ā””ā”€ā”€ *.mp4     (video files)
    ā””ā”€ā”€ text_annotations/
        ā”œā”€ā”€ a2d_annotation.txt  (actual text annotations)
        ā”œā”€ā”€ a2d_missed_videos.txt
        ā””ā”€ā”€ a2d_annotation_with_instances/ 
            ā””ā”€ā”€ */ (video folders)
                ā””ā”€ā”€ *.h5 (annotations files) 

JHMDB-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
ā””ā”€ā”€ jhmdb_sentences/ 
    ā”œā”€ā”€ Rename_Images/  (frame images)
    ā”‚   ā””ā”€ā”€ */ (action dirs)
    ā”œā”€ā”€ puppet_mask/  (mask annotations)
    ā”‚   ā””ā”€ā”€ */ (action dirs)
    ā””ā”€ā”€ jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
ā””ā”€ā”€ refer_youtube_vos/ 
    ā”œā”€ā”€ train/
    ā”‚   ā”œā”€ā”€ JPEGImages/
    ā”‚   ā”‚   ā””ā”€ā”€ */ (video folders)
    ā”‚   ā”‚       ā””ā”€ā”€ *.jpg (frame image files) 
    ā”‚   ā””ā”€ā”€ Annotations/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    ā”‚           ā””ā”€ā”€ *.png (mask annotation files) 
    ā”œā”€ā”€ valid/
    ā”‚   ā””ā”€ā”€ JPEGImages/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    ā”‚           ā””ā”€ā”€ *.jpg (frame image files) 
    ā””ā”€ā”€ meta_expressions/
        ā”œā”€ā”€ train/
        ā”‚   ā””ā”€ā”€ meta_expressions.json  (text annotations)
        ā””ā”€ā”€ valid/
            ā””ā”€ā”€ meta_expressions.json  (text annotations)

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

CommandDescription
-cpath to dataset configuration file
-rmrunning mode (train/eval)
-wswindow size
-bstraining batch size per GPU
-ebseval batch size per GPU (if not provided, training batch size is used)
-ngnumber of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window SizeCommandCheckpoint FilemAP Result
10python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2Link46.1
8python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2Link44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window SizeCommandCheckpoint FilemAP Result
10python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2Link39.2
8python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2Link36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window SizeCommandCheckpoint FileOutput ZipJ&F Result
12python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8LinkLink55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

Window SizeCommand
10python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2
Window SizeCommand
8python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3

Refer-YouTube-VOS

Window SizeCommand
12python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4
Window SizeCommand
8python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

Citation

Please consider citing our work in your publications if it helped you or if it is relevant to your research:

@inproceedings{botach2021end,
  title={End-to-End Referring Video Object Segmentation with Multimodal Transformers},
  author={Botach, Adam and Zheltonozhskii, Evgenii and Baskin, Chaim},
  booktitle={Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}