Awesome

RMem: Restricted Memory Banks Improve Video Object Segmentation

University of Illinois Urbana-Champaign

Abstract

With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expanding memory banks, while seemingly beneficial, actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally, restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked "temporal positional embedding." Finally, our insights are embodied in "RMem" ("R" for restricted), a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (VOST dataset) and long videos (the Long Videos dataset).

Method Overview

method

(a) RMem revisits restricting memory banks to enhance VOS, motivated by the insight from our pilot study.
(b) To maintain an informative memory bank, we balance both the relevance and freshness of frames when updating the latest features.
(c) Benefiting from smaller memory size gaps between training and inference, we introduce previously overlooked temporal positional embedding to encode the orders of frames explicitly, which enhances spatio-temporal reasoning.

Data preparation

Download the VOST dataset from vostdataset.org , and organize the directory structure as follows:

├── aot_plus
│   ├── configs
│   ├── dataloaders
│   ├── datasets
│   │   └── VOST
│   │       ├── Annotations
│   │       ├── ImageSets
│   │       ├── JPEGImages
│   │       └── JPEGImages_10fps
│   ├── docker
│   ├── networks
│   ├── pretrain_models
│   └── tools
├── evaluation
└── README.md

hint: you can achieve it by soft link:
ln -s <your VOST directory>  ./datasets/VOST

Checkpoint

Method	$\mathcal{J}_{tr}$	$\mathcal{J}$
R50 AOTL	37.0	49.2	download link
R50 DeAOTL	37.6	51.0	download link
R50 AOTL + RMem	39.8	50.5	download link
R50 DeAOTL + RMem	40.4	51.8	download link

Download the checkpoint and put them in ./aot_plus/pretrain_models/

Evaluation

Firstly prepare the pytorch environment. Please follow the instructions on pytorch.org and choose the pytorch version that is most compatible with your machine.

Then

conda install numpy matplotlib scipy scikit-learn tqdm pyyaml pandas
pip install opencv-python

Now you can replicate the result of our checkpoint.

cd ./aot_plus
./eval_vost.sh

If you want to evaluate AOT, please modify the eval_vost.sh, change model to r50_aotl and change ckpt_path to aotplus_R50_AOTL_Temp_pe_Slot_4_ema_20000.pth.

Training

If you want to train your own model, you can train it from the AOT/DeAOT model (pretrained on DAVIS and YouTubeVOS) provided by the official AOT team. The models can be accessed from the MODEL_ZOO

Method
R50 AOTL	download link
R50 DeAOTL	download link

Then

cd ./aot_plus
./train_vost.sh