Awesome

M-FAC

This repository contains efficient reference implementations of the static and dynamic M-FAC algorithms, introduced in the paper "M-FAC: Efficient Matrix-Free Approximations of Second-Order Information" to be published at NeurIPS 2021, plus some sample code demonstrating their use in optimization and pruning.

More concretely, it contains the following:

An efficient full implementation of the dynamic algorithm including custom CUDA kernels: optim.py, hinv_cuda_kernel.cu, hinv_cuda.cpp, setup_cuda.py
A PyTorch compatible implementation of the M-FAC optimizer: optim.py
A script for running simple optimization experiments: main_optim.py
An efficient implementation of the static algorithm in blocked form with simultaneous handling of multiple blocks: prun.py
An implementation of the full (non-blocked) static algorithm with efficient paging: prun.py
An implementation of a pruner that utilizes the static algorithm: main_prun.py
A script for running simple gradual and one-shot (also with recomputation) pruning experiments: main_prun.py
Some standard library code for models, data loading, etc.

Optimization:

The CUDA kernels for more efficient coefficient computation: python setup_cuda.py install. The code runs without doing so, but substantially slower.

The file main_optim.py provides a simple command-line interface for running dense M-FAC optimization experients. A sample call could look like this:

CUDA_VISIBLE_DEVICES=0 python3 main_optim.py \
    --model resnet20 \
    --dataset DATASET \
    --optim mfac \
    --ngrads 512 \
    --weightdecay .003 \
    --batchsize 128 \
    --save rn20-mfac.pth \
> rn20-mfac.txt

--model specifies the model to optimize (see --help for a full list of model names), --dataset the path to the dataset (for CIFAR models, the data is automatically downloaded if it does not yet exist), --optim specifies the optimizer to use, --ngrads the size of the sliding window for the dynamic algorithm, --weightdecay the weight decay, --batchsize the batch size and --save the name of the file where the model is stored after each epoch.

Finally, it is worth noting that the optimizer implementation optim.MFAC also has support for sparse optimization.

Pruning:

The file main_prun.py provides a simple interface for executing various gradual and one-shot experiments. Only support for ResNet20/CIFAR with a corresponding pretrained model is currently included, but other models should be straight-forward to add. Here follows an example call:

CUDA_VISIBLE_DEVICES=0 python3 -u main_prun.py \
	--model resnet20 \
	--checkpoint checkpoints/resnet20_cifar10.pth.tar \
	--data datasets \
	--nepochs 10 \
	--optim sgd \
	--lr .005 \
	--momentum .9 \
	--batchsize 128 \
	--drop_at 7 9 \
	--pruner mfac \
	--blocksize 128 \
	--nrecomps 16 \
	--ngrads_schedule 64 \
	--sparsities .5 .75 .875 \
	--prun_every 2 \
	--prun_lrs .005 .0005 \
	--prefix experiments/rn20-test/model

Most arguments are straight-forward, see also --help for a full list of arguments and their descriptions. For gradual pruning, the key arguments are --sparsities, --prun_every and --prun_lrs. The first specifies the individual pruning steps in terms of overall sparsity relative to all pruned parameters (--adjust_sparsities automatically turns these into overall sparsities with respect to all parameters), starting with initial pruning before epoch 0. The second defines how many finetuning epochs there are in between pruning steps while the third gives the learning rates to use for those (as dicussed in the paper, we find that dropping the learning rate one epoch before the next pruning step can be helpful). After the last pruning step is complete, additional finetuning will begin with base learning rate --lr which is dropped by --drop_by (default 0.1) at epochs --drop_at (overall, i.e. also counting the gradual pruning ones). For oneshot experiments, simply set --nepochs to 0.

There are also additional M-FAC paramters --blocksize for blocked estimation (with the advanced optimization parameter --perbatch to specify how many blocks are to be handled simultaneously) and --pages to specify how many pages to use for a full blocksize estimation where the gradients do not fully fit into GPU memory (used when --blocksize is -1).

Models:

Checkpoints of the following sparse MobileNetV1 and ResNet50 models from the practical pruning experiments can be found at this link. They are compatible with the model definitions in the STR repository.

Model @ Sparsity	MBv1 @ 75%	MBv1 @ 89%	RN50 @ 95%	RN50 @ 98%
Accuracy	70.9	67.2	72.6	67.6

Furthermore, our best finetuned BERT-tiny and BERT-mini models for SQuADv2 and GLUE tasks are uploaded to HuggingFace Hub.

BibTeX

@article{frantar2021m,
  title={M-FAC: Efficient Matrix-Free Approximations of Second-Order Information},
  author={Frantar, Elias and Kurtic, Eldar and Alistarh, Dan},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  year={2021}
}