Home

Awesome

Mode Connectivity and Fast Geometric Ensembling

This repository contains a PyTorch implementation of the curve-finding and Fast Geometric Ensembling (FGE) procedures from the paper

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

by Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov and Andrew Gordon Wilson (NIPS 2018, Spotlight).

Introduction

Traditionally the loss surfaces of deep neural networks are thought of as having multiple isolated local optima (see the left panel of the figure below). We show however, that the optima are in fact connected by simple curves, such as a polygonal chain with only one bend, over which training and test accuracy are nearly constant (see the middle and right panels of the figure below) and propose a method to find such curves. Inspired by this geometric observation we propose Fast Geometric Ensembling (FGE), an ensembling method that aims to explore the loss surfaces along the curves of low loss. The method consists of running SGD with a cyclical learning rate schedule starting from a pre-trained solution, and averaging the predictions of the traversed networks. We show that FGE outperforms ensembling independently trained networks and the recently proposed Snapshot Ensembling for any given computational budget.

<p align="center"> <img src="https://user-images.githubusercontent.com/14368801/47261483-3857ce80-d49e-11e8-92c7-1ef44606f503.png" width=275> <img src="https://user-images.githubusercontent.com/14368801/47261482-37bf3800-d49e-11e8-8bbc-0e09619d86dd.png" width=275> <img src="https://user-images.githubusercontent.com/14368801/47261484-3857ce80-d49e-11e8-862d-e042ad8c5ba2.png" width=275> </p>

Please cite our work if you find it useful in your research:

@inproceedings{garipov2018loss,
  title={Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs},
  author={Garipov, Timur and Izmailov, Pavel and Podoprikhin, Dmitrii and Vetrov, Dmitry P and Wilson, Andrew Gordon},
  booktitle={Advances in Neural Information Processing Systems},
  year={2018}
}

Dependencies

Usage

The code in this repository implements both the curve-finding procedure and Fast Geometric Ensembling (FGE), with examples on the CIFAR-10 and CIFAR-100 datasets.

Curve Finding

Training the endpoints

To run the curve-finding procedure, you first need to train the two networks that will serve as the end-points of the curve. You can train the endpoints using the following command

python3 train.py --dir=<DIR> \
                 --dataset=<DATASET> \
                 --data_path=<PATH> \
                 --transform=<TRANSFORM> \
                 --model=<MODEL> \
                 --epochs=<EPOCHS> \
                 --lr=<LR_INIT> \
                 --wd=<WD> \
                 [--use_test]

Parameters:

Use the --use_test flag if you want to use the test set instead of validation set (formed from the last 5000 training objects) to evaluate performance.

For example, use the following commands to train VGG16, PreResNet or Wide ResNet:

#VGG16
python3 train.py --dir=<DIR> --dataset=[CIFAR10 or CIFAR100] --data_path=<PATH> --model=VGG16 --epochs=200 --lr=0.05 --wd=5e-4 --use_test --transform=VGG
#PreResNet
python3 train.py --dir=<DIR> --dataset=[CIFAR10 or CIFAR100] --data_path=<PATH>  --model=[PreResNet110 or PreResNet164] --epochs=150  --lr=0.1 --wd=3e-4 --use_test --transform=ResNet
#WideResNet28x10 
python3 train.py --dir=<DIR> --dataset=[CIFAR10 or CIFAR100] --data_path=<PATH> --model=WideResNet28x10 --epochs=200 --lr=0.1 --wd=5e-4 --use_test --transform=ResNet

Training the curves

Once you have two checkpoints to use as the endpoints you can train the curve connecting them using the following comand.

python3 train.py --dir=<DIR> \
                 --dataset=<DATASET> \
                 --data_path=<PATH> \
                 --transform=<TRANSFORM>
                 --model=<MODEL> \
                 --epochs=<EPOCHS> \
                 --lr=<LR_INIT> \
                 --wd=<WD> \
                 --curve=<CURVE>[Bezier|PolyChain] \
                 --num_bends=<N_BENDS> \
                 --init_start=<CKPT1> \ 
                 --init_end=<CKPT2> \
                 [--fix_start] \
                 [--fix_end] \
                 [--use_test]

Parameters:

Use the flags --fix_end --fix_start if you want to fix the positions of the endpoints; otherwise the endpoints will be updated during training. See the section on training the endpoints for the description of the other parameters.

For example, use the following commands to train VGG16, PreResNet or Wide ResNet:

#VGG16
python3 train.py --dir=<DIR> --dataset=[CIFAR10 or CIFAR100] --use_test --transform=VGG --data_path=<PATH> --model=VGG16 --curve=[Bezier|PolyChain] --num_bends=3  --init_start=<CKPT1> --init_end=<CKPT2> --fix_start --fix_end --epochs=600 --lr=0.015 --wd=5e-4

#PreResNet
python3 train.py --dir=<DIR> --dataset=[CIFAR10 or CIFAR100] --use_test --transform=ResNet --data_path=<PATH> --model=PreResNet164 --curve=[Bezier|PolyChain] --num_bends=3  --init_start=<CKPT1> --init_end=<CKPT2> --fix_start --fix_end --epochs=200 --lr=0.03 --wd=3e-4

#WideResNet28x10
python3 train.py --dir=<DIR> --dataset=[CIFAR10 or CIFAR100] --use_test --transform=ResNet --data_path=<PATH> --model=WideResNet28x10 --curve=[Bezier|PolyChain] --num_bends=3  --init_start=<CKPT1> --init_end=<CKPT2> --fix_start --fix_end --epochs=200 --lr=0.03 --wd=5e-4

Evaluating the curves

To evaluate the found curves, you can use the following command

python3 eval_curve.py --dir=<DIR> \
                 --dataset=<DATASET> \
                 --data_path=<PATH> \
                 --transform=<TRANSFORM>
                 --model=<MODEL> \
                 --wd=<WD> \
                 --curve=<CURVE>[Bezier|PolyChain] \
                 --num_bends=<N_BENDS> \
                 --ckpt=<CKPT> \ 
                 --num_points=<NUM_POINTS> \
                 [--use_test]

Parameters

See the sections on training the endpoints and training the curves for the description of other parameters.

eval_curve.py outputs the statistics on train and test loss and error along the curve. It also saves a .npz file containing more detailed statistics at <DIR>.

CIFAR-100

In the table below we report the minimum and maximum train loss and test error (%) for the networks used as the endpoints and along the curves found by our method on CIFAR-100.

DNN (Curve)Min Train LossMax Train LossMin Test ErrorMax Test Error
VGG16 (Endpoints)0.890.8927.527.5
VGG16 (Bezier)0.480.8927.430.1
VGG16 (Poly)0.591.0527.130.8
PreResNet164 (Endpoints)0.490.4921.621.7
PreResNet164 (Bezier)0.260.4921.323.4
PreResNet164 (Poly)0.300.4921.423.6
WideResNet28x10 (Endpoints)0.200.2118.618.9
WideResNet28x10 (Bezier)0.110.2118.319.2
WideResNet28x10 (Poly)0.130.2118.419.0

Below we show the train loss and test accuracy along the curves connecting two PreResNet164 networks trained with our method on CIFAR100.

<p align="center"> <img src="https://user-images.githubusercontent.com/14368801/47621112-45da0d80-dac9-11e8-9e00-12f53fb4844a.png" width=800> </p>

CIFAR-10

In the table below we report the minimum and maximum train loss and test error (%) for the networks used as the endpoints and along the curves found by our method on CIFAR-10.

DNN (Curve)Min Train LossMax Train LossMin Test ErrorMax Test Error
VGG16 (Single)0.240.246.796.94
VGG16 (Bezier)0.140.246.797.75
VGG16 (Poly)0.160.276.798.08
PreResNet164 (Single)0.180.184.764.75
PreResNet164 (Bezier)0.090.184.454.97
PreResNet164 (Poly)0.110.184.395.13
WideResNet28x10 (Single)0.080.093.693.73
WideResNet28x10 (Bezier)0.050.093.493.88
WideResNet28x10 (Poly)0.050.103.534.29

Fast Geometric Ensembling (FGE)

In order to run FGE you need to pre-train the network to initialize the procedure. To do so follow the instructions in the section on training the endpoints. Then, you can run FGE with the following command

python3 fge.py --dir=<DIR> \
                 --dataset=<DATASET> \
                 --data_path=<PATH> \
                 --transform=<TRANSFORM> \
                 --model=<MODEL> \
                 --epochs=<EPOCHS> \
                 --lr_init=<LR_INIT> \
                 --wd=<WD> \
                 --ckpt=<CKPT> \
                 --lr_1=<LR1> \
                 --lr_2=<LR2> \
                 --cycle=<CYCLE> \
                 [--use_test]

Parameters:

See the section on training the endpoints for the description of the other parameters.

In the Figure below we show the learning rate (top), test error (middle) and distance from the initial value <CKPT> as a function of iteration for FGE with PreResNet164 on CIFAR100. Circles indicate when we save models for ensembling.

<p align="center"> <img src="https://user-images.githubusercontent.com/14368801/47262174-5f6acc00-d4af-11e8-954f-dfef255ad3ae.png" width=500> </p>

CIFAR-100

To reproduce the results from the paper run:

#VGG16
python3 train.py --dir=<DIR> --data_path=<PATH> --dataset=CIFAR100 --use_test --transform=VGG --model=VGG16 --epochs=200 --wd=5e-4 --lr=0.05 --save_freq=40
python3 fge.py --dir=<DIR> --ckpt=<DIR>/checkpoint-160.pt --data_path=<PATH> --dataset=CIFAR100 --use_test --transform=VGG --model=VGG16 --epochs=40 --wd=5e-4 --lr_1=1e-2 --lr_2=1e-2 --cycle=2

#PreResNet
python3 train.py --dir=<DIR>  --data_path=<PATH> --dataset=CIFAR100 --use_test --transform=ResNet --model=PreResNet164 --epochs=200 --wd=3e-4 --lr=0.1 --save_freq=40
python3 fge.py --dir=<DIR> --ckpt=<DIR>/checkpoint-160.pt --data_path=<PATH> --dataset=CIFAR100 --use_test --transform=ResNet --model=PreResNet164 --epochs=40 --wd=3e-4 --lr_1=0.05 --lr_2=0.01 --cycle=2

#WideResNet28x10
python3 train.py --dir=<DIR> --data_path=<PATH> --dataset=CIFAR100 --use_test --transform=ResNet --model=WideResNet28x10 --epochs=40 --wd=5e-4 --lr=0.1 --save_freq=40
python3 fge.py --dir=<DIR> --ckpt=<DIR>/checkpoint-160.pt--data_path=<PATH> --dataset=CIFAR100 --use_test --transform=ResNet --model=WideResNet28x10 --epochs=40 --wd=5e-4 --lr_1=0.05 --lr_2=0.01 --cycle=2

Test accuracy (%) of FGE and ensembling of independently trained networks (Ind) on CIFAR-100 for different training budgets. For each model the Budget is defined as the number of epochs required to train the model with the conventional SGD procedure.

DNN (Method, Budget)1 Budget2 Budgets3 Budgets
VGG16 (Ind, 200)72.5 ± 0.174.875.6
VGG16 (FGE, 200)74.6 ± 0.176.176.6
PreResNet164 (Ind, 200)78.4 ± 0.180.581.6
PreResNet164 (FGE, 200)80.3 ± 0.281.381.7
WideResNet28x10 (Ind, 200)80.8 ± 0.382.483.0
WideResNet28x10 (FGE, 200)82.3 ± 0.282.983.2
<!--- ## CIFAR-10 Test accuracy (%) of FGE and ensembling of independently trained networks (*Ind*) on CIFAR-10 for different training budgets. !!Update!! | DNN (Method, Budget) | 1 Budget | 2 Budgets | 3 Budgets | | ------------------------- |:------------:|:------------:|:----------------:| | VGG16 (Ind, 200) | 93.17 ± 0.15 | 93.58 | 94.05 | | VGG16 (FGE, 200) | 93.64 ± 0.03 | 94.01 | 94.33 | | PreResNet164 (Ind, 150) | 00.00 ± 0.00 | 00.00 | 00.00 | | PreResNet164 (FGE, 150) | 00.00 ± 0.00 | 00.00 | 00.00 | | WideResNet28x10 (Ind, 200)| 96.25 ± 0.04 | 96.65 | 96.80 | | WideResNet28x10 (FGE, 200)| 96.42 ± 0.07 | 96.49 | 96.53 | -->

References

Provided model implementations were adapted from

Other Implementations

Other Relevant Papers