

Bag of Tricks for Image Classification with Convolutional Neural Networks

This repo was inspired by Paper Bag of Tricks for Image Classification with Convolutional Neural Networks

I would test popular training tricks as many as I can for improving image classification accuarcy, feel free to leave a comment about the tricks you want me to test(please write the referenced paper along with the tricks)


Using 4 Tesla P40 to run the experiments


I will use CUB_200_2011 dataset instead of ImageNet, just for simplicity, this is a fine-grained image classification dataset, which contains 200 birds categlories, 5K+ training images, and 5K+ test images.The state of the art acc on vgg16 is around 73%(please correct me if I was wrong).You could easily change it to the ones you like: Stanford Dogs, Stanford Cars. Or even ImageNet.


Use a VGG16 network to test my tricks, also for simplicity reasons, since VGG16 is easy to implement. I'm considering switch to AlexNet, to see how powerful these tricks are.


tricks I've tested, some of them were from the Paper Bag of Tricks for Image Classification with Convolutional Neural Networks :

trickreferenced paper
xavier initUnderstanding the difficulty of training deep feedforward neural networks
warmup trainingAccurate, Large Minibatch SGD: Training ImageNet in 1 Hour
no bias decayHighly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
label smoothingRethinking the inception architecture for computer vision)
random erasingRandom Erasing Data Augmentation
cutoutImproved Regularization of Convolutional Neural Networks with Cutout
linear scaling learning rateAccurate, Large Minibatch SGD: Training ImageNet in 1 Hour
cosine learning rate decaySGDR: Stochastic Gradient Descent with Warm Restarts

and more to come......


baseline(training from sctrach, no ImageNet pretrain weights are used):

vgg16 64.60% on CUB_200_2011 dataset, lr=0.01, batchsize=64

effects of stacking tricks

+xavier init and warmup training66.07%
+no bias decay70.14%
+label smoothing71.20%
+random erasingdoes not work, drops about 4 points
+linear scaling learning rate(batchsize 256, lr 0.04)71.21%
+cutoutdoes not work, drops about 1 point
+cosine learning rate decaydoes not work, drops about 1 point