Home

Awesome

$\boldsymbol{R^2}$-Tuning

arXiv License Hugging Face Spaces

Installation | Dataset | Training | Evaluation | Model Zoo

This repository maintains the official implementation of the paper $\boldsymbol{R^2}$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding by Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, and Chang Wen Chen.

<p align="center"><img width="850" src=".github/model.jpg"></p>

๐Ÿ”ฅ News

๐Ÿ”จ Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

Install from source

  1. Clone the repository from GitHub.
git clone https://github.com/yeliudev/R2-Tuning.git
cd R2-Tuning
  1. Initialize conda environment.
conda create -n r2-tuning python=3.12 -y
conda activate r2-tuning
  1. Install dependencies.
pip install -r requirements.txt

๐Ÿ”– Dataset

Option 1 [Recommended]: Download pre-extracted features from HuggingFace Hub directly.

# Prepare datasets in one command
bash tools/prepare_data.sh

Option 2: Reproduce our data pre-processing pipeline.

  1. Download videos from the following links and place them into data/{dataset}/videos.
  1. Extract and compress video frames at a fixed frame rate.
# For QVHighlights, Ego4D-NLQ, TACoS, and TVSum
python tools/extract_frames.py <path-to-videos>

# For Charades-STA
python tools/extract_frames.py <path-to-videos> --fps 1.0

# For YouTube Highlights
python tools/extract_frames.py <path-to-videos> --anno_path data/youtube/youtube_anno.json
<details> <summary><i>Arguments of <code>tools/extract_frames.py</code></i></summary> <br> </details>
  1. Extract features from video frames.
python tools/extract_feat.py <path-to-anno> <path-to-frames>
<details> <summary><i>Arguments of <code>tools/extract_feat.py</code></i></summary> <br> </details>

The prepared dataset should be in the following structure.

R2-Tuning
โ”œโ”€โ”€ configs
โ”œโ”€โ”€ datasets
โ”œโ”€โ”€ models
โ”œโ”€โ”€ tools
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ qvhighlights
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ qvhighlights_{train,val,test}.jsonl
โ”‚   โ”œโ”€โ”€ ego4d
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ nlq_{train,val}.jsonl
โ”‚   โ”œโ”€โ”€ charades
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_1.0fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ charades_{train,test}.jsonl
โ”‚   โ”œโ”€โ”€ tacos
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ {train,val,test}.jsonl
โ”‚   โ”œโ”€โ”€ youtube
โ”‚   โ”‚   โ”œโ”€โ”€ frames_224_auto (optional)
โ”‚   โ”‚   โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚   โ”‚   โ””โ”€โ”€ youtube_anno.json
โ”‚   โ””โ”€โ”€ tvsum
โ”‚       โ”œโ”€โ”€ frames_224_0.5fps (optional)
โ”‚       โ”œโ”€โ”€ clip_b32_{vid,txt}_k4
โ”‚       โ””โ”€โ”€ tvsum_anno.json
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ setup.cfg
โ””โ”€โ”€ ยทยทยท

๐Ÿ”ฎ Training

Use the following commands to train a model with a specified config.

# Single GPU
python tools/launch.py <path-to-config>

# Multiple GPUs on a single node (elastic)
torchrun --nproc_per_node=<num-gpus> tools/launch.py <path-to-config>

# Multiple GPUs on multiple nodes (slurm)
srun <slurm-args> python tools/launch.py <path-to-config>
<details> <summary><i>Arguments of <code>tools/launch.py</code></i></summary> <br> </details>

Please refer to the configs folder for detailed settings of each model.

๐Ÿ† Evaluation

Use the following command to test a model and evaluate results.

python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --eval

For QVHighlights, you may also dump inference outputs on val and test splits.

python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --dump

Then you can pack the hl_{val,test}_submission.jsonl files and submit them to CodaLab.

๐Ÿ’ป Single Video Inference

[!WARNING] This feature is only compatible with nncore==0.4.4.

Use the following command to perform moment retrieval using your own videos and queries.

# Make sure you are using the correct version
pip install nncore==0.4.4

python tools/inference.py <path-to-video> <query> [--config <path-to-config> --checkpoint <path-to-checkpoint>]

The checkpoint trained on QVHighlights using this config will be downloaded by default.

๐Ÿ“ฆ Model Zoo

We provide multiple pre-trained models and training logs here. All the models were trained on a single NVIDIA A100 80GB GPU and were evaluated using the default metrics of different datasets.

<table> <tr> <th>Dataset</th> <th>Config</th> <th>R1@0.3</th> <th>R1@0.5</th> <th>R1@0.7</th> <th>MR mAP</th> <th>HD mAP</th> <th>Download</th> </tr> <tr> <td align="center"> <a href="https://arxiv.org/abs/2107.09609">QVHighlights</a> </td> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/qvhighlights/r2_tuning_qvhighlights.py">Default</a> </td> <td align="center">78.71</td> <td align="center">67.74</td> <td align="center">51.87</td> <td align="center">47.86</td> <td align="center">39.45</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_qvhighlights-ed516355.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_qvhighlights.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://arxiv.org/abs/2110.07058">Ego4D-NLQ</a> </td> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/ego4d/r2_tuning_ego4d.py">Default</a> </td> <td align="center">7.18</td> <td align="center">4.54</td> <td align="center">2.25</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_ego4d-6a6f5754.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_ego4d.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://arxiv.org/abs/1705.02101">Charades-STA</a> </td> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/charades/r2_tuning_charades.py">Default</a> </td> <td align="center">70.91</td> <td align="center">60.48</td> <td align="center">38.66</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_charades-1de02112.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_charades.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://aclanthology.org/Q13-1003/">TACoS</a> </td> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tacos/r2_tuning_tacos.py">Default</a> </td> <td align="center">50.96</td> <td align="center">40.69</td> <td align="center">25.69</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tacos-81759b55.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tacos.log">log</a> </td> </tr> <tr> <td align="center" rowspan="6"> <a href="https://doi.org/10.1007/978-3-319-10590-1_51">YouTube<br>Highlights</a> </td> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/youtube/r2_tuning_youtube_dog.py">Dog</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">74.26</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_dog-702bd293.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_dog.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/youtube/r2_tuning_youtube_gym.py">Gymnastics</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">72.07</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_gym-ff68b1b3.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_gym.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/youtube/r2_tuning_youtube_par.py">Parkour</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">81.02</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_par-27442af0.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_par.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/youtube/r2_tuning_youtube_ska.py">Skating</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">76.26</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_ska-dad28398.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_ska.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/youtube/r2_tuning_youtube_ski.py">Skiing</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">74.36</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_ski-df2edc4c.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_ski.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/youtube/r2_tuning_youtube_sur.py">Surfing</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">82.76</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_sur-d384d8b2.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_youtube_sur.log">log</a> </td> </tr> <tr> <td align="center" rowspan="10"> <a href="https://doi.org/10.1109/cvpr.2015.7299154">TVSum</a> </td> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_bk.py">BK</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">91.23</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_bk-45b59440.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_bk.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_bt.py">BT</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">92.35</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_bt-13683fdf.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_bt.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_ds.py">DS</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">80.88</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_ds-d11b33d3.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_ds.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_fm.py">FM</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">75.61</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_fm-2c8c119d.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_fm.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_ga.py">GA</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">89.51</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_ga-58e79858.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_ga.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_ms.py">MS</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">85.01</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_ms-d9b4f8fa.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_ms.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_pk.py">PK</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">82.82</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_pk-1830ce03.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_pk.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_pr.py">PR</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">90.39</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_pr-51d78fc9.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_pr.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_vt.py">VT</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">89.81</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_vt-0069d8d3.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_vt.log">log</a> </td> </tr> <tr> <td align="center"> <a href="https://github.com/yeliudev/R2-Tuning/blob/main/configs/tvsum/r2_tuning_tvsum_vu.py">VU</a> </td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">โ€”</td> <td align="center">85.90</td> <td align="center"> <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_vu-6f8ebb4b.pth">model</a> | <a href="https://huggingface.co/yeliudev/R2-Tuning/resolve/main/checkpoints/r2_tuning_tvsum_vu.log">log</a> </td> </tr> </table>

๐Ÿ“– Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024tuning,
  title={$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding},
  author={Liu, Ye and He, Jixuan and Li, Wanhua and Kim, Junsik and Wei, Donglai and Pfister, Hanspeter and Chen, Chang Wen},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2024}
}