Home

Awesome

šŸš€ CLIP-EBC

PWC PWC PWC PWC

The official implementation of CLIP-EBC, proposed in the paper CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification.

At the release page, you can find weights of the models. For the recent updated CLIP-EBC (ViT-B/16) model, we also provide the training logs (both text and tensorboard files).

Results on NWPU Test

MethodsMAERMSE
DMCount-EBC (based on VGG-19)83.7376.0
CLIP-EBC (based on ResNet50)75.8367.3
CLIP-EBC (based on ViT-B/16)61.2278.3

Visualization

Visualization

Citation

If you find this work useful, please consider to cite:

Usage

1. Preprocessing

1.0 Requirements

conda create -n clip_ebc python=3.12.4  # Create a new conda environment. You may use `mamba` instead of `conda` to speed up the installation.
conda activate clip_ebc  # Activate the environment.
pip install -r requirements.txt  # Install the required packages.

1.1 Downloading the datasets

Download all datasets and unzipped them into the folder data.

The data folder should look like:

data:
ā”œā”€ā”€ā”€ ShanghaiTech
ā”‚   ā”œā”€ā”€ part_A
ā”‚   ā”‚   ā”œā”€ā”€ train_data
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ ground-truth
ā”‚   ā”‚   ā””ā”€ā”€ test_data
ā”‚   ā”‚       ā”œā”€ā”€ images
ā”‚   ā”‚       ā””ā”€ā”€ ground-truth
ā”‚   ā””ā”€ā”€ part_B
ā”‚       ā”œā”€ā”€ train_data
ā”‚       ā”‚   ā”œā”€ā”€ images
ā”‚       ā”‚   ā””ā”€ā”€ ground-truth
ā”‚       ā””ā”€ā”€ test_data
ā”‚           ā”œā”€ā”€ images
ā”‚           ā””ā”€ā”€ ground-truth
ā”œā”€ā”€ā”€ NWPU-Crowd
ā”‚   ā”œā”€ā”€ images_part1
ā”‚   ā”œā”€ā”€ images_part2
ā”‚   ā”œā”€ā”€ images_part3
ā”‚   ā”œā”€ā”€ images_part4
ā”‚   ā”œā”€ā”€ images_part5
ā”‚   ā”œā”€ā”€ mats
ā”‚   ā”œā”€ā”€ train.txt
ā”‚   ā”œā”€ā”€ val.txt
ā”‚   ā””ā”€ā”€ test.txt
ā””ā”€ā”€ā”€ UCF-QNRF
    ā”œā”€ā”€ Train
    ā””ā”€ā”€ Test

1.2 Running the preprocessing script

Then, run bash preprocess.sh to preprocess the datasets. In this script, do NOT modify the --dst_dir argument, as the pre-defined paths are used in other files.

2. Training

To train a model, use trainer.py. Below is the script that we used. You can modify the script to train on different datasets and models.

#!/bin/sh
export CUDA_VISIBLE_DEVICES=0  # Set the GPU ID. Comment this line to use all available GPUs.

### Some notes:
# 1. The training script will automatically use all available GPUs in the DDP mode.
# 2. You can use the `--amp` argument to enable automatic mixed precision training to speed up the training process. Could be useful for UCF-QNRF and NWPU.
# 3. Valid values for `--dataset` are `nwpu`, `sha`, `shb`, and `qnrf`.
# See the `trainer.py` for more details.

# Train the commonly used VGG19-based encoder-decoder model on NWPU-Crowd.
python trainer.py \
    --model vgg19_ae --input_size 448 --reduction 8 --truncation 4 --anchor_points average \
    --dataset nwpu \
    --count_loss dmcount &&

# Train the CLIP-EBC (ResNet50) model on ShanghaiTech A. Use `--dataset shb` if you want to train on ShanghaiTech B.
python trainer.py \
    --model clip_resnet50 --input_size 448 --reduction 8 --truncation 4 --anchor_points average --prompt_type word \
    --dataset sha \
    --count_loss dmcount &&

# Train the CLIP-EBC (ViT-B/16) model on UCF-QNRF, using VPT in training and sliding window prediction in testing.
# By default, 32 tokens for each layer are used in VPT. You can also set `--num_vpt` to change the number of tokens.
# By default, the deep visual prompt tuning is used. You can set `--shallow_vpt` to use the shallow visual prompt tuning.
python trainer.py \
    --model clip_vit_b_16 --input_size 224 --reduction 8 --truncation 4 \
    --dataset qnrf --batch_size 16 --amp \
    --num_crops 2 --sliding_window --window_size 224 --stride 224 --warmup_lr 1e-3 \
    --count_loss dmcount

Some Tips

All available models

Arguments in trainer.py

Arguments for models
Arguments for CLIP-based models
Arguments for data
Arguments for evaluation

Note: When using sliding window prediction, if the image size is not a multiple of the window size, then the last stride will be smaller than stride to produce a complete window.

Arguments for training

3. Testing on NWPU Test

To evaluate get the result on NWPU Test, use the test_nwpu.py instead.

# Test CNN-based models
python test_nwpu.py \
    --model vgg19_ae --input_size 448 --reduction 8 --truncation 4 --anchor_points average \
    --weight_path ./checkpoints/nwpu/vgg19_ae_448_8_4_fine_1.0_dmcount_aug/best_mae.pth
    --device cuda:0 &&

# Test ViT-based models. Need to use the sliding window prediction method.
python test_nwpu.py \
    --model clip_vit_b_16 --input_size 224 --reduction 8 --truncation 4 --anchor_points average --prompt_type word \
    --num_vpt 32 --vpt_drop 0.0 --sliding_window --stride 224 \
    --weight_path ./checkpoints/nwpu/clip_vit_b_16_word_224_8_4_fine_1.0_dmcount/best_rmse.pth
    --device cuda:0

4. Visualization

Use the model.ipynb notebook to visualize the model predictions.