Home

Awesome

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Pytorch implementation of our paper Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, which is accepted by CVPR2022. (arXiv)

cvpr2022-6703.png

We also won the 1st place of Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021, with a simplified version of our model.(The code for object tracklets generation is available at here)

Requirements

Python == 3.7 or later, Pytorch == 1.6 or later, for other basic packages, just run the project and download whatever needed.

Datasets

Download the ImageNet-VidVRD dataset and VidOR dataset, and put them in the following folder as

├── dataloaders
│   ├── dataloader_vidvrd.py
│   └── ...
├── datasets
│   ├── cache                       # cache file for our dataloaders
│   ├── vidvrd-dataset
│   │   ├── train
│   │   ├── test
│   │   └── videos
│   ├── vidor-dataset
│   │   ├── annotation
│   │   └── videos
│   └── GT_json_for_eval
│       ├── VidORval_gts.json       # GT josn for evlauate
│       └── VidVRDtest_gts.json
├── tracking_results                # tracklets data & features
│   ├── ...
│   ├── format_demo.py              
│   └── readme.md   
├── experiments   
├── models
├── ...

Verify tracklets data & feature preparation by running dataloader_demo

This section helps you download the tracklets data and place them correctly, as well as set the dataloader's config correctly. Successfully run the tools/dataloader_demo.py to verify all data & configs are set correctly.

NOTE we use the term proposal in our code to represent tracklet proposals in video-level, which is totally different with the concept of "proposal" in "proposal-based methods" in our paper. In our paper, we use "proposals to represent paired subject-object tracklet segments. In contrast, here the term proposal in our code represents long-term object tracklets in video-level (i.e., without sliding window or video segments).

Tracklet data for VidVRD

  1. Download the tracklet with features at here: train, test. And put them in tracking_results/. Refer to tracking_results/readme.md for more details about the tracklet data.

  2. Download the tracklet with features used in "Beyond Short-Term Snippet: Video Relation Detection with Spatio-Temporal Global Context" at the author's personal page here.

    Some Notes

    • we use the term pku (i.e., Peking University) in our code to refer to their tracklets & features)
    • The original data released by them only have 999 .npy files (maybe they have updated the link now), missing data for video ILSVRC2015_train_00884000. So we trained our own Faster-RCNN (same training setting as the above paper), and extract the tracklet & features. And the supplemental data can be find here.
  3. The tracklet with features are in VidVRD_test_every1frames (ours), VidVRD_train_every1frames (ours), preprocess_data/tracking/videovrd_detect_tracking (PKU, both train & test), in whcih each .npy file corresponds to a video and contains all the tracklets in that video. The I3D features of tracklets are in preprocess_data/tracking/videovrd_i3d (PKU, both train & test). Put them under the dir of this project (or any other position if you use absolute path).

  4. modify the config file at experiments/demo/config_.py, where proposal_dir is the dir of tracklet with features, i3d_dir is the dir of tracklets' I3D features, and ann_dir is datasets/vidvrd-dataset.

  5. Verify all data & configs are set correctly. e.g., for PKU's tracklets with I3D features, run the following commands: (refer to tools/dataloader_demo.py for more details.):

    python tools/dataloader_demo.py \
            --cfg_path experiments/demo/config_.py \
            --split test \
            --dataset_class pku_i3d
    

Tracklet data for VidOR

Evaluation:

First, make sure you run tools/dataloader_demo.py successfully

  1. first generate the GT json file for evaluation:

    for vidvrd:

    python VidVRD-helper/prepare_gts_for_eval.py \
        --dataset_type vidvrd \
        --save_path datasets/GT_json_for_eval/VidVRDtest_gts.json
    

    for vidor:

    python VidVRD-helper/prepare_gts_for_eval.py \
        --dataset_type vidor \
        --save_path datasets/GT_json_for_eval/VidORval_gts.json
    
  2. Download model weights for different exps here, and put them in the experiments/ dir. Download pre-prepared data here, and put them in the prepared_data/ dir.

  3. Refer to experiments/readme.md for the correspondence between the exp ids and the table ids in our paper.

  4. For VidVRD, run the following commands to evaluate different exps: (refer to tools/eval_vidvrd.py for more details)

    e.g., for exp1

    python tools/eval_vidvrd.py \
        --cfg_path experiments/exp1/config_.py \
        --ckpt_path experiments/exp1/model_epoch_80.pth \
        --use_pku \
        --cuda 1 \
        --save_tag debug
    
  5. For VidOR, refer to tools/eval_vidor.py for more details.

    Run the following commands to evaluate BIG-C (i.e., only the classification stage):

    python tools/eval_vidor.py \
        --eval_cls_only \
        --cfg_path experiments/exp4/config_.py \
        --ckpt_path experiments/exp4/model_epoch_60.pth \
        --save_tag epoch60_debug \
        --cuda 1
    

    Run the following commands to evaluate BIG based on the output of cls stage (you need run BIG-C first and save the infer_results).

    python tools/eval_vidor.py \
        --cfg_path experiments/grounding_weights/config_.py \
        --ckpt_path experiments/grounding_weights/model_epoch_70.pth \
        --output_dir experiments/exp4_with_grounding \
        --cls_stage_result_path experiments/exp4/VidORval_infer_results_topk3_epoch60_debug.pkl \
        --save_tag with_grd_epoch70 \
        --cuda 1
    

    Run the following commands to evaluate the fraction recall (refer to table-6 in our paper, you need run BIG first and save the hit_infos).

    python tools/eval_fraction_recall.py \
        --cfg_path experiments/grounding_weights/config_.py \
        --hit_info_path  experiments/exp5_with_grounding/VidORval_hit_infos_aft_grd_with_grd_epoch70.pkl
    

NOTE

Training

  1. For VidVRD, run the following commands to train for different exps: (refer to tools/train_vidvrd.py for more details)

    e.g., for exp1

    CUDA_VISIBLE_DEVICES=0,1 python tools/train_vidvrd.py \
        --cfg_path experiments/exp1/config_.py \
        --use_pku \
        --save_tag retrain
    
  2. For VidOR, refer to tools/train_vidor.py.py for more details

    Run the following commands to train BIG-C (i.e., only the classification stage). e.g., for exp4

    CUDA_VISIBLE_DEVICES=0,1 python tools/train_vidor.py \
        --cfg_path experiments/exp4/config_.py \
        --save_tag retrain
    

    Note that we pre-assign all the labels for Base-C (exp6) since it does not require bipartite matching between predicate and GTs. The label assignment takes around 1.5 hours.

    Run the following commands to train the grounding stage:

    CUDA_VISIBLE_DEVICES=2,3 python tools/train_vidor.py \
        --train_grounding \
        --cfg_path experiments/grounding_weights/config_.py \
        --save_tag retrain
    

Data Release Summarize

(dcd is just the name of MEGA cloud account of our Lab :) )

Citation

If our work is helpful for your research, please cite our publication:

@inproceedings{gao2021classification,
  title={Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs},
  author={Gao, Kaifeng and Chen, Long and Niu, Yulei and Shao, Jian and Xiao, Jun},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  year={2022}
}

Others

I have been working on this project for more than one year. So when learning and using pytorch, I wrote a lot of APIs (utils/utils_func.py), in which some of them might be interesting and useful, e.g., unique_with_idx_nd. So I opened a new repo to collect them, here