Home

Awesome

Pytorch element-wise operation optimization benchmark

1. Abstract

Providing a benchmark for element-wise operation performance evaluation on CPU.

Tested CPU:

CPU ModelSocketsCores/SocketFrequency
Intel(R) Xeon(R) CPU E5-2699 v42222.20GHz
Intel(R) Xeon(R) Platinum 8180 CPU2282.50GHz
Intel(R) Core(TM) i7-5960X CPU183.00GHz

Tested operations:

copyadddivsinexpsumprod

Conclusions:

annotation:
OpenMP threshold -- If the size of a tensor is larger than the value, the operations run in parallel, otherwise in serial.

This benchmark also gives a rough estimation of optimal OpenMP threshold of copy, add, div, exp, sin, sum and prod operation on different types of CPU.

For contiguous tensor operation:

 Xeon(R) Platinum 8180 CPUXeon(R) CPU E5-2699 v4i7-5960X CPU
copy80k20k8k
add80k20k8k
div50k10k2k
exp1k1k1k
sin1k1k1k
sum1k1k1k
prod1k1k1k

For discontiguous tensor operation:

Xeon(R) Platinum 8180 CPUXeon(R) CPU E5-2699 v4i7-5960X CPU
copy20k8k2k
add20k8k2k
div10k8k1k
exp1k1k1k
sin2k2k1k
sum1k1k1k
prod1k1k1k

2. Major work

3. Installation and test

3.1 Installation

Official Pytorch

Please refer to official link

Intel Pytorch

Download Intel pytorch source code.

git clone --recursive -b dev-omp2 https://github.com/intel/pytorch.git

Before installing, you should set the CMAKE_PREFIX_PATH.

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

Install intel Pytorch

python setup.py install

3.2 Test

python benchmark.py <CONTIGUITY> <OPERATION> [OUTPUT FILENAME] 

Positional arguments:
CONTIUITY—— operands' contiguity, ontiguous/discontiguous
OPERATION—— operation, copy/add/div/sin/exp/sum/prod

Optional arguments:
o output filename——output filename, output.log is in default

4. The benchmark result

4.1 Contiguous Tensor Operation OpenMP Threshold Tuning

Add, exp operation for contiguous tensors whose sizes range from 1K to 100K are listed here as test cases. We compiled two versions of official Pytorch by setting two different OpenMP threshold. The threshold of one version is set to 100K to make all of the test case runs in series. Meanwhile the threshold of the other one is set to 800 to make all of the test case in parallel.

Platform: Platinum 8180
Operation: add
Tensor Continuity: contiguous
Unit: microsecond

Time cost result is below:

Tensor SizeIn seriesIn parallelSpeedUp
1k1.045.150.20X
2k1.235.470.22X
3k1.335.340.24X
4k1.475.410.27X
5k1.485.400.27X
8k1.815.550.32X
10k1.985.660.35X
20k2.746.740.40X
50k5.126.590.77X
80k14.796.592.24X
100k21.976.703.27X

Conclusion: Setting the threshold to 80K is good for add operation of contiguous tensors.

Platform: Platinum 8180
Operation: exp
Tensor Continuity: contiguous
Unit: microsecond

Time cost result is below:

Tensor SizeIn seriesIn parallelSpeedUp
1k9.485.661.67X
2k17.006.352.67X
3k24.826.034.11X
4k32.526.285.17X
5k40.336.276.42X
8k63.587.049.02X
10k79.137.6110.38X
20k156.789.1117.20X
50k387.8515.0725.73X
80k623.3420.2330.80X
100k779.9523.5733.08X

Conclusion: Setting the threshold to 1K is good for exponential operation of contiguous tensors.

From above results, it is easy to understand that,

We don't list all the detailed data for div, sin, sum and prod operation but provide a rough estimation of optimal OpenMP threshold for different operations.

4.2 Discontiguous tensor operation parallelization

Add and exp operation performance for discontiguous tensors whose sizes range from 1k to 180k are listed. Official pytorch does not optimize operations for discontiguous tensors with OpenMP but Intel version does. In order to expalin that OpenMP also do good in discontiguous tensor operations and to find a optimal OpenMP threshold, we compiled two versions of Pytorch. One is the Official Pytorch. The other one is the Intel one whose OpenMP threshold is set to 800 to make all test cases run in parallel.

Platform: Platinum 8180
Operation: add
Tensor Continuity: discontiguous
Unit: microsecond

Time cost result is below:

Tensor SizeIn seriesIn parallelSpeedUp
1k1.696.980.24X
2k2.427.470.32X
3k3.127.380.42X
4k3.777.430.50X
5k4.467.470.59X
8k6.447.490.85X
10k7.827.691.01X
20k14.547.801.86X
50k34.358.314.13X
80k54.808.686.31X
100k68.829.077.58X
110k75.928.998.43X
120k83.039.528.71X
150k104.249.9210.50X
180k124.2810.6811.62X

Conclusion: Setting the threshold to 10K is good for add operation of discontiguous tensors.

Platform: Platinum 8180
Operation: exp
Tensor Continuity: discontiguous
Unit: microsecond

Time cost result is below:

Tensor SizeIn seriesIn parallelSpeedUp
1k10.027.271.37X
2k19.017.832.42X
3k27.737.483.70X
4k36.457.664.75X
5k45.268.135.56X
8k71.368.708.19X
10k88.759.159.69X
20k176.2611.3215.56X
50k439.6819.0723.04X
80k700.4026.9925.94X
100k876.4227.6131.73X
110k983.7629.7933.01X
120k1050.0731.8732.94X
150k1341.2337.5935.67X
180k1584.8843.2736.62X

Conclusion: Setting the threshold to 1K is good exponential operation of contiguous tensors.

Conclusions:

4.3 LSTM benchmark test

To consolidate the performance boost benefiting from the elementwise optimization, we choose the a widely-used RNN unit: LSTM as the model-level benchmark reference. This is because:

  1. LSTM related computations involve considerable elementwise operations;
  2. PyTorch provides a scalable and flexible Python API to execute LSTM computation.

We retrieve the LSTM benchmark via the script: https://github.com/xhzhao/pytorch-rnn-benchmark , and in which,

  1. The Python API torch.nn.LSTM is used as the entry of LSTM computation.
  2. We run the benchmarks on 24 selective input shapes utilized by different NLP models,
  3. The unit for benchmarks is Sentence Per Second (SPS). [N, T, D, Z] stands for batch size, embedding size, sentence length and hidden size. Specifically, The [64, 50, 500, 500] is used by OpenNMT. The [64, 25, 4096, 4096] is used by Deepbench.

Platform: Platinum-8180
Phase: Inference
Unit: SPS(Scentence per Sencond)

LSTM Input ShapeXeon Platinum 8180 OOBXeon Platinum 8180 OptimizedSpeedUp
[64, 15, 500, 500]899.44947393.768.22X
[64, 20, 500, 500]937.16885895.536.29X
[64, 25, 500,500]750.81594808.176.40X
[64, 30, 500,500]625.8252351.563.76X
[64, 35, 500,500]536.13933446.696.43X
[64, 40, 500,500]469.13562907.746.20X
[64, 45, 500,500]417.3382502.576.00X
[64, 50, 500,500]375.68142412.966.43X
[16, 25, 512, 512]474.96011325.452.79X
[32, 25, 512, 512]606.58532394.693.95X
[64, 25, 512, 512]700.13143661.215.23X
[128, 25, 512, 512]771.52984931.856.39X
[16, 25, 1024, 1024]195.6518434.342.22X
[32, 25, 1024, 1024]261.1828792.483.03X
[64, 25, 1024, 1024]323.73161174.233.62X
[128, 25, 1024, 1024]458.36421793.543.91X
[16, 25, 2048, 2048]48.722971.071.46X
[32, 25, 2048, 2048]77.4796131.741.70X
[64, 25, 2048, 2048]132.8328245.781.85X
[128, 25, 2048, 2048]178.2548429.592.41X
[16, 25, 4096, 4096]12.499516.991.36X
[32, 25, 4096, 4096]23.058228.891.25X
[64, 25, 4096, 4096]39.372553.481.36X
[128, 25, 4096, 4096]61.86697.971.58X

Platform: Platinum-8180
Phase: Training
Unit: SPS(Scentence per Sencond)

LSTM Input ShapeXeon Platinum 8180 OOBXeon Platinum 8180 OptimizedSpeed-up
[64, 15, 500, 500]432.5038740.191.71X
[64, 20, 500, 500]385.2532506.491.31X
[64, 25, 500,500]308.066476.331.55X
[64, 30, 500,500]264.2467406.491.54X
[64, 35, 500,500]217.2079362.41.67X
[64, 40, 500,500]199.5474321.251.61X
[64, 45, 500,500]187.0923292.011.56X
[64, 50, 500,500]159.5678255.321.60X
[16, 25, 512, 512]168.2578269.111.60X
[32, 25, 512, 512]217.3134365.271.68X
[64, 25, 512, 512]273.1848475.261.74X
[128, 25, 512, 512]320.5748549.361.71X
[16, 25, 1024, 1024]62.469289.461.43X
[32, 25, 1024, 1024]89.6243144.031.61X
[64, 25, 1024, 1024]127.414199.491.57X
[128, 25, 1024, 1024]174.6576255.071.46X
[16, 25, 2048, 2048]18.830925.691.36X
[32, 25, 2048, 2048]30.995747.011.52X
[64, 25, 2048, 2048]51.282175.981.48X
[128, 25, 2048, 2048]71.7206113.271.58X
[16, 25, 4096, 4096]6.07887.461.23X
[32, 25, 4096, 4096]10.95413.981.28X
[64, 25, 4096, 4096]18.595524.851.34X
[128, 25, 4096, 4096]28.136639.011.39X

Platform: CPU E5-2699 v4
Phase: Inference
Unit: SPS(Scentence per Sencond)

LSTM Input ShapeXeon E5-2699 OOBXeon E5-2699 OptimizedSpeed-up
[64, 15, 500, 500]  1169.7377149.826.11X
[64, 20, 500, 500]  923.54996033.546.53X
[64, 25, 500,500]  739.81014846.396.55X
[64, 30, 500,500]  618.09394027.086.52X
[64, 35, 500,500]528.33233401.536.44X
[64, 40, 500,500]462.21872972.326.43X
[64, 45, 500,500]410.53862625.956.40X
[64, 50, 500,500]369.91792372.846.41X
[16, 25, 512, 512]639.42132172.633.40X
[32, 25, 512, 512]680.31613561.475.24X
[64, 25, 512, 512]727.89964864.456.68X
[128, 25, 512, 512]760.90955754.567.56X
[16, 25, 1024, 1024]320.01691381.034.32X
[32, 25, 1024, 1024]349.77381916.545.48X
[64, 25, 1024, 1024]368.356822656.15X
[128, 25, 1024, 1024]490.11872518.245.14X
[16, 25, 2048, 2048]137.989383.872.78X
[32, 25, 2048, 2048]159.1569590.483.71X
[64, 25, 2048, 2048]214.677720.813.36X
[128, 25, 2048, 2048]210.0029683.883.26X
[16, 25, 4096, 4096]42.735370.061.64X
[32, 25, 4096, 4096]66.9777126.431.89X
[64, 25, 4096, 4096]82.5284180.122.18X
[128, 25, 4096, 4096]83.1054180.032.17X

Platform: CPU E5-2699 v4
Phase: Training
Unit: SPS(Scentence per Sencond)

LSTM Input ShapeXeon E5-2699 OOBXeon E5-2699 OptimizedSpeed-up
[64, 15, 500, 500]451.2899627.661.39X
[64, 20, 500, 500]370.242497.261.34X
[64, 25, 500,500]298.1386363.611.22X
[64, 30, 500,500]251.8914327.721.30X
[64, 35, 500,500]225.749285.991.27X
[64, 40, 500,500]192.7014271.031.41X
[64, 45, 500,500]175.5287245.51.40X
[64, 50, 500,500]161.343229.741.42X
[16, 25, 512, 512]207.6788201.70.97X
[32, 25, 512, 512]250.4016301.761.21X
[64, 25, 512, 512]306.2745429.341.40X
[128, 25, 512, 512]345.1608456.061.32X
[16, 25, 1024, 1024]66.263267.931.03X
[32, 25, 1024, 1024]37.8289114.713.03X
[64, 25, 1024, 1024]76.6716173.852.27X
[128, 25, 1024, 1024]141.61852181.54X
[16, 25, 2048, 2048]20.578920.821.01X
[32, 25, 2048, 2048]34.504736.931.07X
[64, 25, 2048, 2048]55.150962.731.14X
[128, 25, 2048, 2048]71.771788.761.24X
[16, 25, 4096, 4096]6.86797.091.03X
[32, 25, 4096, 4096]12.571813.851.10X
[64, 25, 4096, 4096]20.155423.661.17X
[128, 25, 4096, 4096]27.407434.491.26X

Conclusion:

According to the benchmarks retrieved on Intel Xeon Platforms, On Platinum 8180:

  1. For LSTM inference (forward-only), the performance is get boosted from 1.25X to 8.22X.
  2. For LSTM training (forward + backward), the performance is get boosted from 1.23X to 1.74X.

On E5-2699 V4:

  1. For LSTM inference (forward-only), the performance is get boosted from 1.64X to 7.56X.
  2. For LSTM training (forward + backward), the performance is get boosted from 1.01X to 3.03X.

Test results analysis:

  1. For inference benchmarks: As the contributions of elementwise operation varies from the different input shapes, it is expected the performance boosts are not uniform with input shape changing.
  2. For training benchmarks: Apart from sharing the same reason of inference benchmarks. As the backward computation gains less from the elementwise optimization, it is expected the performance boosts on training benchmarks are not outstanding as inference benchmarks, and not uniform with input shape changing.