Home

Awesome

convnet-benchmarks

Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.

Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz + NVIDIA Titan X + Ubuntu 14.04 x86_64

Imagenet Winners Benchmarking

I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.

Notation

Input is described as {batch_size}x{num_filters}x{filter_width}x{filter_height}. Where batch_size is the number of images used in a minibatch, num_filters is the number of channels in an image, filter_width is the width of the image, and filter_height is the height of the image.

One small note:

The CuDNN benchmarks are done using Torch bindings. One can also do the same via Caffe bindings or bindings of any other library. This note is here to clarify that Caffe (native) and Torch (native) are the convolution kernels which are present as a default fallback. Some of the frameworks like TensorFlow and Chainer are benchmarked with CuDNN, but it is not explicitly mentioned, and hence one might think that these frameworks as a whole are faster, than for example Caffe, which might not be the case.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

LibraryClassTime (ms)forward (ms)backward (ms)
CuDNN[R4]-fp16 (Torch)cudnn.SpatialConvolution712546
Nervana-neon-fp16ConvLayer782552
CuDNN[R4]-fp32 (Torch)cudnn.SpatialConvolution812753
TensorFlowconv2d812655
Nervana-neon-fp32ConvLayer872858
fbfft (Torch)fbnn.SpatialConvolution1043172
ChainerConvolution2D17740136
cudaconvnet2*ConvLayer17742135
CuDNN[R2] *cudnn.SpatialConvolution23170161
Caffe (native)ConvolutionLayer324121203
Torch-7 (native)SpatialConvolutionMM342132210
CL-nn (Torch)SpatialConvolutionMM963388574
Caffe-CLGreenTeaConvolutionLayer14422101232

Overfeat [fast] - Input 128x3x231x231

LibraryClassTime (ms)forward (ms)backward (ms)
Nervana-neon-fp16ConvLayer17658118
Nervana-neon-fp32ConvLayer21169141
CuDNN[R4]-fp16 (Torch)cudnn.SpatialConvolution24286156
CuDNN[R4]-fp32 (Torch)cudnn.SpatialConvolution26894174
TensorFlowconv2d27990189
fbfft (Torch)SpatialConvolutionCuFFT342114227
ChainerConvolution2D620135484
cudaconvnet2*ConvLayer723176547
CuDNN[R2] *cudnn.SpatialConvolution810234576
CaffeConvolutionLayer823355468
Torch-7 (native)SpatialConvolutionMM878379499
CL-nn (Torch)SpatialConvolutionMM963388574
Caffe-CLGreenTeaConvolutionLayer28576162240

OxfordNet [Model-A] - Input 64x3x224x224

LibraryClassTime (ms)forward (ms)backward (ms)
Nervana-neon-fp16ConvLayer25482171
Nervana-neon-fp32ConvLayer320103217
CuDNN[R4]-fp16 (Torch)cudnn.SpatialConvolution471140331
CuDNN[R4]-fp32 (Torch)cudnn.SpatialConvolution529162366
TensorFlowconv2d540158382
ChainerConvolution2D885251632
fbfft (Torch)SpatialConvolutionCuFFT1092355737
cudaconvnet2*ConvLayer1229408821
CuDNN[R2] *cudnn.SpatialConvolution1099342757
CaffeConvolutionLayer1068323745
Torch-7 (native)SpatialConvolutionMM1105350755
CL-nn (Torch)SpatialConvolutionMM34378752562
Caffe-CLGreenTeaConvolutionLayer56209884632

GoogleNet V1 - Input 128x3x224x224

LibraryClassTime (ms)forward (ms)backward (ms)
Nervana-neon-fp16ConvLayer23072157
Nervana-neon-fp32ConvLayer27084186
TensorFlowconv2d445135310
CuDNN[R4]-fp16 (Torch)cudnn.SpatialConvolution462112349
CuDNN[R4]-fp32 (Torch)cudnn.SpatialConvolution470130340
ChainerConvolution2D687189497
CaffeConvolutionLayer19357861148
CL-nn (Torch)SpatialConvolutionMM701630273988
Caffe-CLGreenTeaConvolutionLayer94627468716

Layer-wise Benchmarking (Last Updated April 2015)

Spatial Convolution layer (3D input 3D output, densely connected)

forward + backprop (wrt input and weights)
Original LibraryClass/Function BenchmarkedTime (ms)forward (ms)backward (ms)
fbfftSpatialConvolutionCuFFT256101155
cuda-convnet2 *ConvLayer977201776
cuda-convnet**pylearn2.cuda_convnet1077312765
CuDNN R2 *cudnn.SpatialConvolution1019269750
TheanoCorrMM1225407818
CaffeConvolutionLayer1231396835
Torch-7SpatialConvolutionMM1265418877
DeepCLConvolutionLayer628026483632
cherry-picking****best per layer23579155

This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.

Original LibraryClass/Function BenchmarkedTime (ms)forward (ms)backward (ms)
Theano (experimental)***conv2d_fft1178304874
Torch-7nn.SpatialConvolutionBHWD18925811311
ccvccv_convnet_layer809+bw809
Theano (legacy)conv2d70774383366941
Breakdown
forward

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original LibraryClass/Function BenchmarkedL1L2L3L4L5Total
fbfftSpatialConvolutionCuFFT5727629101
cuda-convnet2 *ConvLayer361134048201
cuda-convnet**pylearn2.cuda_convnet3818368716312
CuDNN R2cudnn.SpatialConvolution5614353611269
TheanoCorrMM911431212428407
CaffeConvolutionLayer<Dtype>931361162427396
Torch-7nn.SpatialConvolutionMM941491232428418
DeepCLConvolutionLayer7381241518471042648
cherry-picking****best per layer362762879
backward (gradInput + gradWeight)

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original LibraryClass/Function BenchmarkedL1L2L3L4L5Total
fbfftSpatialConvolutionCuFFT764512418155
cuda-convnet2 *ConvLayer1034671621529776
cuda-convnet**pylearn2.cuda_convnet1364331471534765
CuDNN R2cudnn.SpatialConvolution1394011591932750
TheanoCorrMM1794051742931818
CaffeConvolutionLayer<Dtype>2004051722830835
Torch-7nn.SpatialConvolutionMM2064321782932877
DeepCLConvolutionLayer4842144747591983632
cherry-picking****best per layer764512418155