Home

Awesome

Mixed Precision Training

in PyTorch


Training in FP16 that is in half precision results in slightly faster training in nVidia cards that supports half precision ops. Also the memory requirements of the models weights are almost halved since we use 16-bit format to store the weights instead of 32-bits.

Although training in half precision has it's own caveats. The problems that is encountered in half precision training are:

Below is a discussion on how to deal with these problems.

FP16 Basics

IEEE-754 floating point starndard states that given a floating point number X if,
2^E <= abs(X) < 2^(E+1) then the distance from X to the next largest representable floating point number epsilon is:

The above equations allow us to compute the following:

Imprecise Weight Update

Thus while training our network we'll need that added precision, since our weights will go through small updates. For example 1 + 0.0001 will result in:

What that means is that we risk underflow (attempting to represent numbers so small they clamp to zero) and overflow (numbers so large they become NaN, not a number). With underflow, our network never learns anything, and with overflow, it learns garbage.
To overcome this we keep a "FP32 master copy". It is a copy of out FP16 model weights in FP32. We use these master params to update our weights and then copy them back to our model. We also update the gradients in the master copy as they are calculated in the model.

Gradients Underflow

Gradients are sometime not representable in FP16. This leads to the gradient underflow problem. A way to deal with this problem is to shift the gradient bitwise, so that they are in a range representable by half-precision floats. A way to do this is to multiply the loss by a large number like 2^7 that would shift the computed gradients during loss.backward() to the FP16 representable range. Then when we copy these gradients to FP32 master copy, we scale them back down by dividing the gradients with the same scaling factor.

Reduction Overflow

Another caveat with half-precision is that while doing large reductions it may overflow. For example consider two tensor:

If we were to do a.sum() and b.sum() it would result in 16376 and 16380 respectively, as expected in single-point precision. But if we did the same ops in half point precision it would result in 16376 and 16384 respectively. To overcome this problem we do the reduction ops like BatchNorm and loss calculation in FP32.

All these problems have been kept in mind to help us successfully train with FP16 weights. Implementation of the above ideas can be found in the train.py file.

Usage Instruction

python main.py [-h] [--lr LR] [--steps STEPS] [--gpu] [--fp16] [--loss_scaling] [--model MODEL]

PyTorch (FP16) CIFAR10 Training

optional arguments:
  -h, --help            Show this help message and exit
  --lr LR               Learning Rate
  --steps STEPS, -n STEPS
                        No of Steps
  --gpu, -p             Train on GPU
  --fp16                Train with FP16 weights
  --loss_scaling, -s    Scale FP16 losses
  --model MODEL, -m MODEL
                        Name of Network

To run in FP32 mode, use:
python main.py -n 200 -p --model resnet50

To train with FP16 weights, use:
python main.py -n 200 -p --fp16 -s --model resnet50
-s flag enables loss scaling.

Results

Training on a single P100 Pascal GPU, I was able to obtain the following result, while training with ResNet50 with a batch size of 128 over 200 epochs.

FP32Mixed Precision
Time/Epoch1m32s1m15s
Storage90 MB46 MB
Accuracy94.50%94.43%

Training on 4x K80 Tesla GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32Mixed Precision
Time/Epoch1m24s1m17s
Storage90 MB46 MB
Accuracy94.634%94.922%

Training on 4x P100 Tesla GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32Mixed Precision
Time/Epoch26s224ms23s359ms
Storage90 MB46 MB
Accuracy94.51%94.78%

Training on a single V100 Volta GPUs, with ResNet50 with a batch size of 128 over 200 epochs.

FP32Mixed Precision
Time/Epoch47s112ms25s601ms
Storage90 MB46 MB
Accuracy94.87%94.65%

Training on 4x V100 Volta GPUs, with ResNet50 with a batch size of 512 over 200 epochs.

FP32Mixed Precision
Time/Epoch17s841ms12s833ms
Storage90 MB46 MB
Accuracy94.38%94.60%

Speedup of the row setup with respect to the column setup is summarized in the following table.

1xP100:FP321xP100:FP164xP100:FP324xP100:FP161xV100:FP321xV100:FP164xV100:FP32
1xP100:FP1622.67 %0.0 %0.0 %0.0 %0.0 %0.0 %0.0 %
4xP100:FP32250.82 %186.0 %0.0 %0.0 %79.65 %0.0 %0.0 %
4xP100:FP16293.85 %221.08 %12.27 %0.0 %101.69 %9.6 %0.0 %
1xV100:FP3295.28 %59.2 %0.0 %0.0 %0.0 %0.0 %0.0 %
1xV100:FP16259.36 %192.96 %2.43 %0.0 %84.02 %0.0 %0.0 %
4xV100:FP32415.67 %320.38 %46.99 %30.93 %164.07 %43.5 %0.0 %
4xV100:FP16616.9 %484.43 %104.35 %82.02 %267.12 %99.49 %39.02 %

TODO

Further Explorations:

Convenience:

nVidia provides the apex library that handles all the caveats of training in mixed precision. It also provides API for multiprocess distributed training with NCCL and SyncBatchNorm which reduces stats across processes during multiprocess distributed data parallel training.


Thanks:

The project heavily borrows from @kuangliu's project pytorch-cifar. The models have been directly borrowed from the repository with minimal change, so thanks to @kuangliu for maintaining such awesome project.