Awesome
DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video (ECCV 2024)
[<a href="https://dino-tracker.github.io/" target="_blank">Project Page</a>] [<a href="https://arxiv.org/abs/2403.14548" target="_blank">arXiv</a>]
https://github.com/AssafSinger94/dino-tracker/assets/43016459/bcd7d57a-1977-4f05-a2f4-ddd859524598
Usage
Setup
Clone the repository:
git clone https://github.com/AssafSinger94/dino-tracker.git
Switch to the project directory:
cd dino-tracker
To setup the environment, run:
conda create -n dino-tracker python=3.9
conda activate dino-tracker
pip install -r requirements.txt
Add current path to PYTHONPATH
:
export PYTHONPATH=`pwd`:$PYTHONPATH
Preprocessing
Given an input video, we start by extracting optical flow and DINO best-buddy correspondences. The input video directory should have the following structure:
├──<VIDEO_DIR>
├──video/
├──00000.png
├──00001.png
├──...
├──masks/ # optional
├──00000.png
├──00001.png
├──...
where masks
contains the per-frame foreground masks. If masks
is not provided, foreground masks are automatically computed using DINO features saliency maps.
In case the video is in mp4 format, convert it to frames by simply running:
python ./preprocessing/mp4_to_frames.py \
--video-path <PATH_TO_MP4> \
--output-folder <VIDEO_DIR_PATH>/video
To run the preprocessing pipeline, run the following:
python ./preprocessing/main_preprocessing.py \
--config ./config/preprocessing.yaml \
--data-path <VIDEO_DIR_PATH>
The script outputs chained optical flow trajectories, DINO embeddings and DINO best-buddies in the following structure:
├──<VIDEO_DIR>
├──video/
├──masks/
├──dino_best_buddies/
├──dino_embeddings/
├──of_trajectories/
Training
Once preprocessing is finished, run the following command to train DINO-Tracker:
python ./train.py \
--config ./config/train.yaml \
--data-path <VIDEO_DIR_PATH>
The checkpoints are saved under:
├──<VIDEO_DIR>
├──models
├──dino_tracker
├──delta_dino_<ITER>.pt
├──tracker_head_<ITER>.pt
Inference
Trajectory creation and visualization
To predict and visualize trajectories with a trained DINO-Tracker, run the following scripts sequentially:
python ./inference_grid.py \
--config ./config/train.yaml \
--data-path <VIDEO_DIR_PATH> \
--use-segm-mask # optional, used for sampling only foreground points
python visualization/visualize_rainbow.py \
--data-path <VIDEO_DIR_PATH> \
--plot-trails # optional, used for visualizing motion trails.
The first script creates trajectories for a grid of query points in the first frame, while the second script visualizes them. The --plot-trails
option is used for visualizing motion trails. Note that this option requires a segmentation mask for the first frame. If --plot-trails
is not provided, the script only visualizes the tracked positions in circles. The visualizations are outputted under <VIDEO_DIR_PATH>/visualizations
directory.
TAP-Vid evaluation
To evaluate on TAP-Vid-DAVIS, please see the following steps. The same steps can be applied for TAP-Vid Kinetics and BADJA datasets.
-
Download benchmark data file
tapvid_davis_data_strided.pkl
from this link, put it under./tapvid/tapvid_davis_data_strided.pkl
, -
Download pre-trained weights and videos from this link under
davis_480.zip
, unzip the folder to./dataset/tapvid-davis/
, -
Extract DINO embeddings for all videos by running the following:
python ./preprocessing/save_dino_embed_video.py \
--config ./config/preprocessing.yaml \
--data-path ./dataset/tapvid-davis/<VIDEO_ID>
The above should be run for all videos in the benchmark, e.g. <VIDEO_ID> = {0, 1, ..., 29}
for DAVIS.
- Predict trajectories on benchmark query points by running the following for all benchmark videos:
python inference_benchmark.py \
--config ./config/train.yaml \
--data-path ./dataset/tapvid-davis/<VIDEO_ID> \
--benchmark-pickle-path ./tapvid/tapvid_davis_data_strided.pkl \
--video-id <VIDEO_ID>
- Evaluate the model accuracy by running the following:
python ./eval/eval_benchmark.py \
--dataset-root-dir ./dataset/tapvid-davis \
--benchmark-pickle-path ./tapvid/tapvid_davis_data_strided.pkl \
--out-file ./tapvid/comp_metrics_davis.csv \
--dataset-type tapvid # tapvid | BADJA
The evaluation should output:
average_pts_within_thresh: 0.8066 | occlusion_acc: 0.8854 | average_jaccard: 0.6528
.
The output CSV file contains all TAP-Vid metrics (position accuracy, occlusion accuracy, Average Jaccard) for all videos.
Citation
@misc{dino_tracker_2024,
author = {Tumanyan, Narek and Singer, Assaf and Bagon, Shai and Dekel, Tali},
title = {DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video},
year = {2024},
booktitle = {European Conference on Computer Vision (ECCV)}
}