Awesome
RMem: Restricted Memory Banks Improve Video Object Segmentation
Junbao Zhou, Ziqi Pang, Yu-Xiong Wang
University of Illinois Urbana-Champaign
Abstract
With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expanding memory banks, while seemingly beneficial, actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally, restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked "temporal positional embedding." Finally, our insights are embodied in "RMem" ("R" for restricted), a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (VOST dataset) and long videos (the Long Videos dataset).
Method Overview
- (a) RMem revisits restricting memory banks to enhance VOS, motivated by the insight from our pilot study.
- (b) To maintain an informative memory bank, we balance both the relevance and freshness of frames when updating the latest features.
- (c) Benefiting from smaller memory size gaps between training and inference, we introduce previously overlooked temporal positional embedding to encode the orders of frames explicitly, which enhances spatio-temporal reasoning.
Data preparation
Download the VOST dataset from vostdataset.org , and organize the directory structure as follows:
├── aot_plus
│ ├── configs
│ ├── dataloaders
│ ├── datasets
│ │ └── VOST
│ │ ├── Annotations
│ │ ├── ImageSets
│ │ ├── JPEGImages
│ │ └── JPEGImages_10fps
│ ├── docker
│ ├── networks
│ ├── pretrain_models
│ └── tools
├── evaluation
└── README.md
hint: you can achieve it by soft link:
ln -s <your VOST directory> ./datasets/VOST
Checkpoint
Method | $\mathcal{J}_{tr}$ | $\mathcal{J}$ | |
---|---|---|---|
R50 AOTL | 37.0 | 49.2 | download link |
R50 DeAOTL | 37.6 | 51.0 | download link |
R50 AOTL + RMem | 39.8 | 50.5 | download link |
R50 DeAOTL + RMem | 40.4 | 51.8 | download link |
Download the checkpoint and put them in ./aot_plus/pretrain_models/
Evaluation
Firstly prepare the pytorch environment. Please follow the instructions on pytorch.org and choose the pytorch version that is most compatible with your machine.
Then
conda install numpy matplotlib scipy scikit-learn tqdm pyyaml pandas
pip install opencv-python
Now you can replicate the result of our checkpoint.
cd ./aot_plus
./eval_vost.sh
If you want to evaluate AOT, please modify the eval_vost.sh
, change model
to r50_aotl
and change ckpt_path
to aotplus_R50_AOTL_Temp_pe_Slot_4_ema_20000.pth
.
Training
If you want to train your own model, you can train it from the AOT/DeAOT model (pretrained on DAVIS and YouTubeVOS) provided by the official AOT team. The models can be accessed from the MODEL_ZOO
Method | |
---|---|
R50 AOTL | download link |
R50 DeAOTL | download link |
Then
cd ./aot_plus
./train_vost.sh