Awesome
[CVPR 2024] Iterated Learning Improves Compositionality in Large Vision-Language Models
This is the implementation of the paper Iterated Learning Improves Compositionality in Large Vision-Language Models.
We design an iterated learning algorithm that improves the compositionality in large vision-language models, inspired by cultural transmission theory in cognitive science.
:wrench: Installation
Please run the following commands to initiate a fresh conda environment and install the required packages.
conda create -n clipenv python=3.8
conda activate clipenv
pip install -r requirements.txt
To evaluate the model in compositionality benchmarks like CREPE, SugarCREPE, you need to download their required data. Checkout the correponding repositories for details.
:keyboard: Usage
Evaluate a pretrained model
All the testing scripts are integrated at test.sh
. To evaluate the model, simply run:
bash test.sh <model-type> <checkpoint-path> <task>
<model-type>
can be fdt
if you are evaluating codebook varients of CLIP model (like the model that we use), or clip
if evaluting CLIP baseline.
<checkpoint-path>
is the folder that contain model checkpoints.
<task>
can be one of compositionality
, retrieval
, recognition
, probing
.
Note that we use clip-benchmark for evaluating recognition and retrieval. It will automatically download the required datasets in data
folder.
The pretrained model checkpoints can be found here
Training CLIP model using iterated learning
First, to prepare the data for training, we recommand using publically available image-text datasets, such as Conceptual Captions (CC3M), Conceptual 12M (CC12M), and LAION115M. img2dataset is a very convenient tool for downloading these large scale data.
After preparaing the data (cc3m in this example), to train a VIT-B/32 CLIP model using our iterated learning algorithm, please run
bash run.sh example/clip_fdt/train_solver.py \
--config example/clip_fdt/config_cc3m.yaml \
--output_path output \
--batch_size 256 \
--exp_name cc3m_IL_6000
This scripts assume the usage of 4 gpus in 1 node. You can modify the gpu number and node number in run.sh
.
Training Standard CLIP model as baseline
To train a baseline CLIP model (also ViT-B/32), please run
bash run.sh example/clip/train_solver.py \
--config example/clip/config_cc3m.yaml \
--output_path output \
--batch_size 256 \
--exp_name baseline_clip
:paperclip: Cite
If you find this repository useful, please consider citing:
@inproceedings{zheng2024iterated,
title={Iterated learning improves compositionality in large vision-language models},
author={Zheng, Chenhao and Zhang, Jieyu and Kembhavi, Aniruddha and Krishna, Ranjay},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13785--13795},
year={2024}
}
:open_book: Acknowledge
Part of our code is referenced from the following repositories/sources. We thank the authors for making their code available.