Home

Awesome

VL-PET

<p align="center"> <a href="https://henryhzy.github.io/VL-PET/"><img src="images/logo.svg" width="30%"></a> </p> <p align="center"> <a href="https://henryhzy.github.io/VL-PET/"><img src="images/ICCV2023_poster.jpg" width="100%"></a> </p>

Official code for "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control" (ICCV2023)

Authors: Zi-Yuan Hu<sup>1,3</sup>, Yanyang Li<sup>1</sup>, Michael R. Lyu<sup>1</sup> and Liwei Wang<sup>*1,2</sup> (<sup>*</sup>Corresponding Author)

<strong> <sup>1</sup>The Chinese University of Hong Kong<br> <sup>2</sup>Centre for Perceptual and Interactive Intelligence<br> <sup>3</sup>Shanghai AI Laboratory<br> </strong> <br> <a href='https://henryhzy.github.io/VL-PET'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2308.09804'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a> <a href='https://arxiv.org/pdf/2308.09804.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a> <a href='ICCV2023_poster.pdf'><img src='https://img.shields.io/badge/Poster-PDF-red'></a> <a href='https://opensource.org/licenses/MIT'><img src='https://img.shields.io/badge/License-MIT-yellow.svg'></a> <a href="https://hits.seeyoufarm.com"><img src="https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FHenryHZY%2FVL-PET&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=views&edge_flat=false"/></a> <br> Project page (with more details and fun fact of our logo): [VL-PET](https://henryhzy.github.io/VL-PET)

Abstract

As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues.

In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders.

Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements.

VL-PET Framework

<p align="center"> <img src="images/model.png" width="60%"><br> </p> <p align="center"> <img src="images/framework.png" width="100%"> </p>

Experiments

<p align="center"> <img src="images/relative.png" width="60%"><br> </p> <p align="center"> <img src="images/result.png" width="100%"> </p> <p align="center"> <img src="images/apply.png" width="80%"> </p>

Quick Start

1. Installation

conda create -n vlpet
conda activate vlpet
pip install -r requirements.txt
python -c "import language_evaluation; language_evaluation.download('coco')"
<details> <summary>Click for more details... </summary>

More details about the installation:

GPU: A100 (80GB)
Driver Version: 470.129.06
CUDA Version: 11.4
python: 3.8.13
torch: 1.8.0+cu111
torchvision: 0.9.0+cu111
transformers: 4.2.1
</details>

2. Dataset Preparation

You are recommended to follow the dataset downloading instruction of VL-Adapter.

The following is the file structure of the datasets for your convenience:

<details> <summary>Click for more details... </summary>
datasets/    <= for dataset downloading, please refer to VL-Adapter
    ├── COCO
    │   └── clip_features
    ├── GQA
    │   └── clip_features
    ├── lxmert
    ├── nlvr
    │   └── clip_features
    ├── paragraphs
    ├── VG
    │   └── clip_features
    ├── video
    │   ├── ann
    │   │   ├── how2qa
    │   │   ├── how2r
    │   │   ├── tvc
    │   │   ├── tvqa
    │   │   ├── tvr
    │   │   ├── yc2c
    │   │   └── yc2r
    │   └── vis_features
    │       ├── how2
    │       │   └── clip-vit
    │       ├── tv
    │       │   └── clip-vit
    │       └── yc2
    │           └── clip-vit
    └── vqa
</details>

3. Training & Evaluation (VL-PET-large)

Taking VL-PET-large as an example, we can conduct training and evaluation on different tasks as follows:

Code Structure

The following is the file structure of VL-PET project for your convenience:

<details> <summary>Click for more details... </summary>
./datasets/  <= the details are listed in the section of Dataset Preparation
    ├──...
    └──...

./VL-PET/
    ├── src/    <= store code implementation for VL-PET and state-of-the-art baselines based on BART-base and T5-base
    └── scripts
        ├── image-text    <= store scripts for running on image-text tasks
        └── scripts/video-text    <= store scripts for running on video-text tasks
</details>

Running Command

For other experiments, we can replace VL-PET-large in the .sh file name with VL-PET-middleX, VL-PET-middleY, VL-PET-small, full_finetuning, bitfit and so on. The details of the hyper-parameters are reported in the appendix of our paper.

1. VL-PET-large

Please refer to Quick Start.

2. VL-PET-middleX

<details> <summary>Click for more details... </summary>
# VL-PET-middleX on image-text tasks (BART-base)
bash scripts/image-text/VL-PET-middleX.sh 20000 96 4 96 1e-3 42

# VL-PET-middleX on image-text tasks (T5-base)
bash scripts/image-text/T5-VL-PET-middleX.sh 20001 192 4 0.3 96 3e-4 42

# VL-PET-middleX on video-text tasks (BART-base)
bash scripts/video-text/VL-PET-middleX.sh 20002 96 4 96 7e-4 20 42
</details>

3. VL-PET-middleY

<details> <summary>Click for more details... </summary>
# VL-PET-middleY on image-text tasks (BART-base)
bash scripts/image-text/VL-PET-middleY.sh 20000 96 4 96 1e-3 42

# VL-PET-middleY on image-text tasks (T5-base)
bash scripts/image-text/T5-VL-PET-middleY.sh 20001 192 4 0.3 96 3e-4 42

# VL-PET-middleY on video-text tasks (BART-base)
bash scripts/video-text/VL-PET-middleY.sh 20002 96 4 96 7e-4 20 42
</details>

4. VL-PET-small

<details> <summary>Click for more details... </summary>
# VL-PET-small on image-text tasks (BART-base)
bash scripts/image-text/VL-PET-small.sh 20000 96 4 96 1e-3 42

# VL-PET-small on image-text tasks (T5-base)
bash scripts/image-text/T5-VL-PET-small.sh 20001 192 4 0.3 96 3e-4 42

# VL-PET-small on video-text tasks (BART-base)
bash scripts/video-text/VL-PET-small.sh 20002 96 4 96 7e-4 20 42
</details>

5. Baselines

<details> <summary>Click for more details... </summary>

For baselines (e.g., full fine-tuning, VL-Adapter, compacter and so on), please refer to VL-Adapter and Ladder-Side-Tuning.

</details>

Checkpoints & Logs

We provide checkpoints & logs for BART-base on image-text tasks as follows:

MethodParams (%)VQA (%)GQA (%)NLVR$^2$ (%)COCO (CIDEr)Avg.Checkpoints & Logs
VL-PET-small2.9865.3654.0872.50121.0778.25Link
VL-PET-middleX2.9865.4554.3772.86121.0978.44Link
VL-PET-middleY2.9865.5354.0873.92120.2078.43Link
VL-PET-large4.1666.4054.9473.36122.1179.20Link

Acknowledgements

This work benefits from VL-Adapter, Ladder-Side-Tuning and unify-parameter-efficient-tuning. Our logo is borrowed from OpenMoji. Thanks for their awesome works!

Reference

If you find VL-PET useful for your research, please consider giving this repository a star and citing our paper as follows:

@inproceedings{hu2023vlpet,
  title     = {VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control},
  author    = {Zi-Yuan Hu, Yanyang Li, Michael R. Lyu and Liwei Wang},
  booktitle = {ICCV},
  year      = {2023}
}