Awesome
FitCLIP
This repo contains the code for the BMVC 2022 paper FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks.
Setup
Having Conda installed:
conda env create
conda activate sm
Download the datasets and models
To use many of the datasets we used here, you need to download them. Go to their official website to find how to
download each of them. Check out the config files under config/data
to find out what paths you need to set up for
them.
Similarly, to use many of the pre-trained models, you may need to download them and place them under a specific path (or
change the path). Check out the configs under config/encoder
. You may need to preprocess them as well so to only have
the state dict (as opposed to the whole checkpoint, including for example the optimizer state). Checkout the scripts
under scripts/
to preprocess them.
Run the evaluation
Run like:
python -m aligner command=evaluate encoder=$MODEL data=$DATASET
Checkout the options with --help
and the available configs under config/
. Next are some example runs.
Evaluate our main model
We provide a student model created with our method. To correctlly use it, step 2 from our method needs to be applied (weight-space ensembling between the student checkpoint and the original CLIP checkpoint). By running this command, step 2 is applied on the fly and evaluated on multiple benchmarks:
student=https://github.com/bryant1410/fitclip/releases/download/publish/distill_clip_webvid_4_5k_webvid_fit_64_64_fix_temp_lab_loss_09999_best_lab_val_loss_only_student.pt
aligner \
--multirun \
command=evaluate \
encoder=wise \
+encoder@encoder.model1=clip_vit_b_16 \
+encoder@encoder.model2=clip_from_pretrained \
+encoder.model2.model.name="$student" \
data=didemo,moments_in_time,msrvtt,ucf101,webvid,youcook2 \
silent=true
The checkpoint is going to be automatically download the first time is used and cached for future use. If this doesn't work, you can still download it yourself and pass a local path instead.
CLIP on WebVid val
Run like:
python -m aligner command=evaluate encoder=clip_vit_b_16 data=webvid
Frozen in Time on WebVid val
python -m aligner command=evaluate encoder=frozen_in_time data=webvid
Evaluate on multiple benchmarks at the same time
python -m aligner --multirun command=evaluate encoder=clip_vit_b_16 \
data=didemo,moments_in_time,msrvtt,ucf101,webvid,youcook2
Evaluate a custom checkpoint
Suppose the checkpoint path is a.pt
. Then, run:
python -m aligner \
--multirun \
command=evaluate \
encoder=clip_from_pretrained \
+encoder.model.name=$PWD/a.pt \
data=moments_in_time,msrvtt,webvid,youcook2 \
silent=true
Save a model's predictions
python -m aligner command=predict
It'll be saved in predictions.pt
.
You can see the options with --help
and change the config file accordingly.
Train a model (reproduce the paper results)
Run:
python -m aligner \
--config-name teacher_student_train.yaml \
command=train \
+encoder@encoder.student=clip_vit_b_16 \
+encoder@encoder.teacher=clip_vit_b_16 \
data=mixed_batch_webvid_4_5k_all \
++model.fit_temperature=false \
++trainer.val_check_interval=30 \
++trainer.callbacks.3.train_time_interval.hours=0 \
++trainer.callbacks.3.train_time_interval.seconds=30 \
++trainer.callbacks.3.save_top_k=-1 \
++model.labeled_dataset_loss_share=0.9999
Then, grab the latest checkpoint generated under ckpt=outputs/${DATE_AND_TIME}/checkpoints/best_labeled.ckpt
, and get
the student model:
./scripts/checkpoint_to_state_dict.py "$ckpt" > $student
Then you can evaluate it:
aligner \
--multirun \
command=evaluate \
encoder=wise \
+encoder@encoder.model1=clip_vit_b_16 \
+encoder@encoder.model2=clip_from_pretrained \
+encoder.model2.model.name="$student" \
data=didemo,moments_in_time,msrvtt,ucf101,webvid,youcook2 \
silent=true
Citation
If you use this code, please cite:
@inproceedings{Castro_2022_BMVC,
author = {Santiago Castro and Fabian Caba},
title = {{FitCLIP}: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year = {2022},
url = {https://bmvc2022.mpi-inf.mpg.de/0939.pdf}
}
Troubleshooting
Hydra shell completion doesn't work
See https://github.com/facebookresearch/hydra/issues/1957
UCF101 website SSL certificate is not recognized
The problem is that the server's certificate chain is incomplete. The intermediate CA cert can be manually added:
sudo sh -c "curl https://www.incommon.org/custom/certificates/repository/sha384%20Intermediate%20cert.txt \
> /usr/local/share/ca-certificates/incommon.crt"
sudo update-ca-certificates
# The requests library CA cert list also needs to be updated. Run like:
curl https://www.incommon.org/custom/certificates/repository/sha384%20Intermediate%20cert.txt \
>> $CONDA_PREFIX/lib/python3.8/site-packages/certifi/cacert.pem
Protobuf version error
If you have an error like:
This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.4).
Do:
conda install protobuf=3.9.2