Awesome
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes (ECCV2024 Workshop)
This repo is the official implementation for Generalized SAM accepted by ECCV2024 Workshop Computational Aspects of Deep Learning (CADL).
Highlights
<div align="center"> <img src="figs/img1.png" width="60%"> <img src="figs/img2.png" width="30%"> </div>- Training using random cropping: Our Generalized SAM (GSAM) can cope with variable input image sizes, allowing random cropping to be used the first time during fine-tuning for SAM.
- Multi-scalce AdaptFormer: GSAM can use multi-scale features during fine-tuning for SAM.
- Low computational cost of training: compared to the conventional SAM fine-tuning methods, GSAM can significantly reduce the computational training cost and GPU memories.
Installation
Following Segment Anything, python=3.8.16
, pytorch=1.8.0
, and torchvision=0.9.0
are used in GSAM.
- Clone this repository.
git clone https://github.com/usagisukisuki/G-SAM.git cd G-SAM
- Install Pytorch and TorchVision. (you can follow the instructions here)
- Install other dependencies.
pip install -r requirements.txt
Checkpoints
We use checkpoint of SAM in vit_b version. Please download from SAM and extract them under "models/Pretrained_model".
models
├── Pretrained_model
├── sam_vit_b_01ec64.pth
Dataset
Step 1 : Please download from [CamVid], [M-Building], [ISBI2012], [Kvasir-SEG], [Synapse], [Cityscapes], [Trans10k].
Step 2 : please extract them under "Dataset", and make them look like this:
Dataset
├── CamVid
├─ train
├─ trainannot
├─ ...
├── M-building
├─ png
├─ train
├─ train_labels
├─ ...
├─ tiff
├── ISBI2012
├─ Image
├─ Label
├── Kvasir
├─ datamodel
├─ ...
├── Synapse
├─ datamodel
├─ ...
├── Citycapes
├─ gtFine
├─ leftImg8bit
├── Trans10k
├─ train
├─ test
├─ val
Fine tuning on SAM
Binary segmentation
If we prepared the binary segmentation dataset (e.g. ISBI2012), we can directly run the following code to train the model with single GPU.
python3 train.py --gpu 0 --dataset 'ISBI2012' --out result_sam --modelname 'SAM' --batchsize 8
If we want to utilize multi GPUs, we can directly run the following code.
CUDA_VISIBLE_DEVICES=0,1 python3 train.py --dataset 'ISBI2012' --out result_sam --modelname 'SAM' --batchsize 8 --multi
Multi-class segmentation
If we prepared the multi-class segmentaiton dataset (e.g. Cityscapes), we can directly run the following code to train the model with single GPU.
python3 train.py --gpu 0 --dataset 'Cityscapes' --out result_sam --modelname 'SAM' --batchsize 8 --num_classes=19 --multimask_output=True
Fine tuning on Generalised SAM
We can try to use our GSAM. Please run the following code to train the improved SAM.
python3 train.py --gpu 0 --dataset 'ISBI2012' --modelname 'GSAM'
Fine tuning on SAM with Anything
We can also try to use variour adaptation methods. Please run the following code to train the improved SAM.
python3 train.py --gpu 0 --dataset 'ISBI2012' --modelname 'SAM_LoRA'
python3 train.py --gpu 0 --dataset 'ISBI2012' --modelname 'SAM_ConvLoRA'
python3 train.py --gpu 0 --dataset 'ISBI2012' --modelname 'SAM_AdaptFormer'
python3 train.py --gpu 0 --dataset 'ISBI2012' --modelname 'SAMUS'
Results
We assessed different types of image data from various domains with varying input image sizes.(In-vehicle, Satellite, Microscopic, Medical, and Transparent object images...)
Method | CamVid | M-Building | ISBI | Kvasior-SEG | Synapse | Cityscapes | Trans10k |
---|---|---|---|---|---|---|---|
SAM | 58.27 | 67.59 | 72.15 | 75.94 | 40.61 | 57.15 | 83.37 |
LoRA | 65.20 | 76.76 | 79.18 | 82.20 | 39.08 | 59.09 | 85.71 |
ConvLoRA | 66.96 | 77.32 | 79.87 | 85.20 | 43.41 | 62.43 | 86.47 |
AdaptFormer | 74.80 | 80.46 | 80.46 | 88.53 | 61.28 | 75.49 | 89.91 |
SAMUS | 48.42 | 49.87 | 78.64 | 88.28 | 20.66 | 48.61 | 87.18 |
GSAM | 67.21 | 80.69 | 80.53 | 87.83 | 72.78 | 74.10 | 87.08 |
And we compared the MACs and segmentation accuracy (ISBI2012).
Method | MACs(G) | mIoU |
---|---|---|
SAM | 371.98 | 72.15 |
LoRA | 371.98 | 79.18 |
ConvLoRA | 511.45 | 79.87 |
AdaptFormer | 386.48 | 80.46 |
SAMUS | 145.87 | 78.64 |
GSAM(random crop=256×256) | 270.33 | 80.63 |
GSAM(random crop=128×128) | 74.07 | 80.53 |
GSAM(random crop=64×64) | 18.53 | 78.53 |
GSAM(random crop=32×32) | 7.42 | 71.45 |
Citation
@article{kato2024generalized,
title={Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes},
author={Kato, Sota and Mitsuoka, Hinako and Hotta, Kazuhiro},
journal={arXiv preprint arXiv:2408.12406},
year={2024}
}