Awesome

Kandinsky calibration for image segmentation models

Example usage

We will train, calibrate and evaluate a model on the MS-COCO dataset, restricting attention to the "person" class.

Preparation

Start by installing the requirements (preferably in a conda environment):

python -m pip install -r requirements.txt

Now we need to download MS-COCO and prepare the relevant subset. We use fiftyone for this. Configure the target directories in src/utils/coco-prepare.py and run the file. The download and export will take a while (~1.5h approximately). The test set of MS-COCO is not labeled, so we repurpose the validation set for testing. We then split the original train set into train, validation, and calibration data.

Our goal is to calibrate a miscalibrated model. Obtaining such a miscalibrated model is easier if we train on a small number of images. We also do not need many validation images, since optimizing training is not relevant here. We therefore use most of COCO's original train split as calibration data (and we can later choose to reduce the number of calibration images to evaluate different calibration methods in a low-data scenario). To make the train/validation/calibration manifest files, run src/utils/coco-split.py.

The notebook notebooks/check-coco.ipynb can be used to verify that the training and validation splits were properly created and that they can be located by the CocoDataset.

Training

Set the environment variables DATA_TRAINVAL_ROOT and DATA_TEST_ROOT. A sample training configuration for training a UNet on 1000 images (this split should have been created in the preparation step) is provided in configs/experiment/train_coco-person_t1000.yaml. Run

python src/train.py experiment=train_coco-person_t1000

until you get sufficiently many model checkpoints to choose from. The models with a larger generalization gap (val_loss - train_loss) are usually less calibrated,<sup>1</sup> and are therefore good testing grounds for the Kandinsky method. Furthermore, the validation Dice score will usually keep rising even after validation loss has hit a minimum, which is another reason that you might not want to stop training early. In the default setup, 250 epochs of training should suffice. Validation Dice and loss curves will look approximately like this:

You can check the (mis)calibration properties of the model checkpoints by running

python src/eval.py ckpt_path=logs/train/runs/train_coco-person_t1000/<rest of checkpoint path>

and inspecting the resulting images in logs/eval/.... Here we show an example of the Expected Calibration Error (ECE) plotted for each individual pixel:

Calibrating

Calibration amounts to evaluating the trained model on a calibration set and storing the resulting non-conformity scores per pixel. These can subsequently be used for computing non-conformity curves (a list of 100 equally spaced quantiles of the nc-scores), where the chosen calibration method (i.e. pixel/image/kandinsky) determines how we go from the scores to the curves.

Running calibration is as simple as

# pixelwise
python src/calibrate.py experiment=cal_coco-person_t1000_c2000_pixel ckpt_path=logs/train/runs/train_coco-person_t1000/...
# imagewise
python src/calibrate.py experiment=cal_coco-person_t1000_c2000_image ckpt_path=logs/train/runs/train_coco-person_t1000/...
# kandinsky (k-means clustering)
python src/calibrate.py experiment=cal_coco-person_t1000_c2000_kmeans ckpt_path=logs/train/runs/train_coco-person_t1000/...
# kandinsky (genetic annuli optimization)
python src/calibrate.py experiment=cal_coco-person_t1000_c2000_gen-ann ckpt_path=logs/train/runs/train_coco-person_t1000/...

In this example we are using 2.000 images for calibration. The calibrated checkpoint is stored in a subfolder of logs/calibrate/runs/cal_coco-person_t1000_c20000_<method> as cmodel.ckpt. This checkpoint has a field nc_curves that contains the nonconformity curves for every pixel in the form of a [100, C, H, W] tensor. Depending on the chosen calibration method, these curves may differ per pixel (pixelwise), be identical for every pixel (imagewise), or differ by pixel grouping (kandinsky). The upshot is that this tensor can be used for creating each pixel's prediction set at the desired confidence level.

Evaluating

Models can be tested using the same evaluation script as mentioned earlier:

python src/eval.py ckpt_path=logs/calibrate/runs/<experiment name>/cmodel.ckpt experiment=eval_coco-person_t1000_c2000_pixel

Calibrated models contain the field nc_curves and the evaluation procedure will output additional output plots and a tensor containing the per-pixel coverage error if this field is found.

Citation

If you use Kandinsky calibration in your own research, please consider citing the companion paper.

@misc{kandinskycp2023,
  title = {Kandinsky {{Conformal Prediction}}: {{Efficient Calibration}} of {{Image Segmentation Algorithms}}},
  author = {Brunekreef, Joren and Marcus, Eric and Sheombarsing, Ray and Sonke, Jan-Jakob and Teuwen, Jonas},
  year = {2023},
  month = nov,
  number = {arXiv:2311.11837},
  eprint = {2311.11837},
  primaryclass = {cs},
  publisher = {{arXiv}},
  doi = {10.48550/arXiv.2311.11837},
  urldate = {2023-11-27},
  abbr = {arXiv},
  archiveprefix = {arxiv},
  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning}
}

Acknowledgments

Our codebase was built upon the Lightning Hydra template by Lukasz Zalewski.