Awesome
Pytorch element-wise operation optimization benchmark
1. Abstract
Providing a benchmark for element-wise operation performance evaluation on CPU.
Tested CPU:
CPU Model | Sockets | Cores/Socket | Frequency |
---|---|---|---|
Intel(R) Xeon(R) CPU E5-2699 v4 | 2 | 22 | 2.20GHz |
Intel(R) Xeon(R) Platinum 8180 CPU | 2 | 28 | 2.50GHz |
Intel(R) Core(TM) i7-5960X CPU | 1 | 8 | 3.00GHz |
Tested operations:
copy | add | div | sin | exp | sum | prod |
Conclusions:
- OpenMP threshold which is set to 100k in official version is too high for contiguous tensors of small and medium size to benefit from OpenMP parallelism.
- Discontiguous tensors' operations can be boosted significantly by Intel Pytorch .
- The optimal OpenMP threshold is dependent on the operation type and CPU type.
- OpenMP threshold becomes smaller for more complex operations.
- OpenMP threshold of discontiguous tensor is usually lower than that of contiguous tensor.
annotation:
OpenMP threshold -- If the size of a tensor is larger than the value, the operations run in parallel, otherwise in serial.
This benchmark also gives a rough estimation of optimal OpenMP threshold of copy, add, div, exp, sin, sum and prod operation on different types of CPU.
For contiguous tensor operation:
Xeon(R) Platinum 8180 CPU | Xeon(R) CPU E5-2699 v4 | i7-5960X CPU | |
---|---|---|---|
copy | 80k | 20k | 8k |
add | 80k | 20k | 8k |
div | 50k | 10k | 2k |
exp | 1k | 1k | 1k |
sin | 1k | 1k | 1k |
sum | 1k | 1k | 1k |
prod | 1k | 1k | 1k |
For discontiguous tensor operation:
Xeon(R) Platinum 8180 CPU | Xeon(R) CPU E5-2699 v4 | i7-5960X CPU | |
---|---|---|---|
copy | 20k | 8k | 2k |
add | 20k | 8k | 2k |
div | 10k | 8k | 1k |
exp | 1k | 1k | 1k |
sin | 2k | 2k | 1k |
sum | 1k | 1k | 1k |
prod | 1k | 1k | 1k |
2. Major work
-
Optimal OpenMP threshold is identified to fully exploit performance potentiality on CPU
The OpenMP threshold of official Pytorch is set to 100K. However, the evidence gained by benchmarking copy, add, div, exp, sin operation in both contiguous and discontiguous cases on different CPU types shows that the value is too high. A rough estimation of optimal OpenMP threshold is also proposed for those operations. -
Discontiguous tensors' operation parallelization with OpenMP
Slice operation of tensor is very common in science computation. Using slice operation will generate discontiguous tensor. Meanwhile, Official Pytorch does not support parallelism of discontiguous tensor at the moment. Our main work is trying to fill this blank. Code available at dev-omp and upstreaming is in progress.
3. Installation and test
3.1 Installation
Official Pytorch
Please refer to official link
Intel Pytorch
Download Intel pytorch source code.
git clone --recursive -b dev-omp2 https://github.com/intel/pytorch.git
Before installing, you should set the CMAKE_PREFIX_PATH.
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
Install intel Pytorch
python setup.py install
3.2 Test
python benchmark.py <CONTIGUITY> <OPERATION> [OUTPUT FILENAME]
Positional arguments:
CONTIUITY
—— operands' contiguity, ontiguous/discontiguous
OPERATION
—— operation, copy/add/div/sin/exp/sum/prod
Optional arguments:
o output filename
——output filename, output.log is in default
4. The benchmark result
4.1 Contiguous Tensor Operation OpenMP Threshold Tuning
Add, exp operation for contiguous tensors whose sizes range from 1K to 100K are listed here as test cases. We compiled two versions of official Pytorch by setting two different OpenMP threshold. The threshold of one version is set to 100K to make all of the test case runs in series. Meanwhile the threshold of the other one is set to 800 to make all of the test case in parallel.
Platform: Platinum 8180
Operation: add
Tensor Continuity: contiguous
Unit: microsecond
Time cost result is below:
Tensor Size | In series | In parallel | SpeedUp |
---|---|---|---|
1k | 1.04 | 5.15 | 0.20X |
2k | 1.23 | 5.47 | 0.22X |
3k | 1.33 | 5.34 | 0.24X |
4k | 1.47 | 5.41 | 0.27X |
5k | 1.48 | 5.40 | 0.27X |
8k | 1.81 | 5.55 | 0.32X |
10k | 1.98 | 5.66 | 0.35X |
20k | 2.74 | 6.74 | 0.40X |
50k | 5.12 | 6.59 | 0.77X |
80k | 14.79 | 6.59 | 2.24X |
100k | 21.97 | 6.70 | 3.27X |
Conclusion: Setting the threshold to 80K is good for add operation of contiguous tensors.
Platform: Platinum 8180
Operation: exp
Tensor Continuity: contiguous
Unit: microsecond
Time cost result is below:
Tensor Size | In series | In parallel | SpeedUp |
---|---|---|---|
1k | 9.48 | 5.66 | 1.67X |
2k | 17.00 | 6.35 | 2.67X |
3k | 24.82 | 6.03 | 4.11X |
4k | 32.52 | 6.28 | 5.17X |
5k | 40.33 | 6.27 | 6.42X |
8k | 63.58 | 7.04 | 9.02X |
10k | 79.13 | 7.61 | 10.38X |
20k | 156.78 | 9.11 | 17.20X |
50k | 387.85 | 15.07 | 25.73X |
80k | 623.34 | 20.23 | 30.80X |
100k | 779.95 | 23.57 | 33.08X |
Conclusion: Setting the threshold to 1K is good for exponential operation of contiguous tensors.
From above results, it is easy to understand that,
- Different operations have their own optimal OpenMP threshold, but 100K is not suitable.
- OpenMP threshold becomes smaller for more complex operations.
We don't list all the detailed data for div, sin, sum and prod operation but provide a rough estimation of optimal OpenMP threshold for different operations.
4.2 Discontiguous tensor operation parallelization
Add and exp operation performance for discontiguous tensors whose sizes range from 1k to 180k are listed. Official pytorch does not optimize operations for discontiguous tensors with OpenMP but Intel version does. In order to expalin that OpenMP also do good in discontiguous tensor operations and to find a optimal OpenMP threshold, we compiled two versions of Pytorch. One is the Official Pytorch. The other one is the Intel one whose OpenMP threshold is set to 800 to make all test cases run in parallel.
Platform: Platinum 8180
Operation: add
Tensor Continuity: discontiguous
Unit: microsecond
Time cost result is below:
Tensor Size | In series | In parallel | SpeedUp |
---|---|---|---|
1k | 1.69 | 6.98 | 0.24X |
2k | 2.42 | 7.47 | 0.32X |
3k | 3.12 | 7.38 | 0.42X |
4k | 3.77 | 7.43 | 0.50X |
5k | 4.46 | 7.47 | 0.59X |
8k | 6.44 | 7.49 | 0.85X |
10k | 7.82 | 7.69 | 1.01X |
20k | 14.54 | 7.80 | 1.86X |
50k | 34.35 | 8.31 | 4.13X |
80k | 54.80 | 8.68 | 6.31X |
100k | 68.82 | 9.07 | 7.58X |
110k | 75.92 | 8.99 | 8.43X |
120k | 83.03 | 9.52 | 8.71X |
150k | 104.24 | 9.92 | 10.50X |
180k | 124.28 | 10.68 | 11.62X |
Conclusion: Setting the threshold to 10K is good for add operation of discontiguous tensors.
Platform: Platinum 8180
Operation: exp
Tensor Continuity: discontiguous
Unit: microsecond
Time cost result is below:
Tensor Size | In series | In parallel | SpeedUp |
---|---|---|---|
1k | 10.02 | 7.27 | 1.37X |
2k | 19.01 | 7.83 | 2.42X |
3k | 27.73 | 7.48 | 3.70X |
4k | 36.45 | 7.66 | 4.75X |
5k | 45.26 | 8.13 | 5.56X |
8k | 71.36 | 8.70 | 8.19X |
10k | 88.75 | 9.15 | 9.69X |
20k | 176.26 | 11.32 | 15.56X |
50k | 439.68 | 19.07 | 23.04X |
80k | 700.40 | 26.99 | 25.94X |
100k | 876.42 | 27.61 | 31.73X |
110k | 983.76 | 29.79 | 33.01X |
120k | 1050.07 | 31.87 | 32.94X |
150k | 1341.23 | 37.59 | 35.67X |
180k | 1584.88 | 43.27 | 36.62X |
Conclusion: Setting the threshold to 1K is good exponential operation of contiguous tensors.
Conclusions:
- Discontiguous operation can be improved a lot by using OpenMP optimization.
- OpenMP threshold of discontiguous tensor is usually lower than that of contiguous tensor because the same operation of discontiguous tensor is more time-consuming than contiguous tensor.
4.3 LSTM benchmark test
To consolidate the performance boost benefiting from the elementwise optimization, we choose the a widely-used RNN unit: LSTM as the model-level benchmark reference. This is because:
- LSTM related computations involve considerable elementwise operations;
- PyTorch provides a scalable and flexible Python API to execute LSTM computation.
We retrieve the LSTM benchmark via the script: https://github.com/xhzhao/pytorch-rnn-benchmark , and in which,
- The Python API torch.nn.LSTM is used as the entry of LSTM computation.
- We run the benchmarks on 24 selective input shapes utilized by different NLP models,
- The unit for benchmarks is Sentence Per Second (SPS). [N, T, D, Z] stands for batch size, embedding size, sentence length and hidden size. Specifically, The [64, 50, 500, 500] is used by OpenNMT. The [64, 25, 4096, 4096] is used by Deepbench.
Platform: Platinum-8180
Phase: Inference
Unit: SPS(Scentence per Sencond)
LSTM Input Shape | Xeon Platinum 8180 OOB | Xeon Platinum 8180 Optimized | SpeedUp |
---|---|---|---|
[64, 15, 500, 500] | 899.4494 | 7393.76 | 8.22X |
[64, 20, 500, 500] | 937.1688 | 5895.53 | 6.29X |
[64, 25, 500,500] | 750.8159 | 4808.17 | 6.40X |
[64, 30, 500,500] | 625.825 | 2351.56 | 3.76X |
[64, 35, 500,500] | 536.1393 | 3446.69 | 6.43X |
[64, 40, 500,500] | 469.1356 | 2907.74 | 6.20X |
[64, 45, 500,500] | 417.338 | 2502.57 | 6.00X |
[64, 50, 500,500] | 375.6814 | 2412.96 | 6.43X |
[16, 25, 512, 512] | 474.9601 | 1325.45 | 2.79X |
[32, 25, 512, 512] | 606.5853 | 2394.69 | 3.95X |
[64, 25, 512, 512] | 700.1314 | 3661.21 | 5.23X |
[128, 25, 512, 512] | 771.5298 | 4931.85 | 6.39X |
[16, 25, 1024, 1024] | 195.6518 | 434.34 | 2.22X |
[32, 25, 1024, 1024] | 261.1828 | 792.48 | 3.03X |
[64, 25, 1024, 1024] | 323.7316 | 1174.23 | 3.62X |
[128, 25, 1024, 1024] | 458.3642 | 1793.54 | 3.91X |
[16, 25, 2048, 2048] | 48.7229 | 71.07 | 1.46X |
[32, 25, 2048, 2048] | 77.4796 | 131.74 | 1.70X |
[64, 25, 2048, 2048] | 132.8328 | 245.78 | 1.85X |
[128, 25, 2048, 2048] | 178.2548 | 429.59 | 2.41X |
[16, 25, 4096, 4096] | 12.4995 | 16.99 | 1.36X |
[32, 25, 4096, 4096] | 23.0582 | 28.89 | 1.25X |
[64, 25, 4096, 4096] | 39.3725 | 53.48 | 1.36X |
[128, 25, 4096, 4096] | 61.866 | 97.97 | 1.58X |
Platform: Platinum-8180
Phase: Training
Unit: SPS(Scentence per Sencond)
LSTM Input Shape | Xeon Platinum 8180 OOB | Xeon Platinum 8180 Optimized | Speed-up |
---|---|---|---|
[64, 15, 500, 500] | 432.5038 | 740.19 | 1.71X |
[64, 20, 500, 500] | 385.2532 | 506.49 | 1.31X |
[64, 25, 500,500] | 308.066 | 476.33 | 1.55X |
[64, 30, 500,500] | 264.2467 | 406.49 | 1.54X |
[64, 35, 500,500] | 217.2079 | 362.4 | 1.67X |
[64, 40, 500,500] | 199.5474 | 321.25 | 1.61X |
[64, 45, 500,500] | 187.0923 | 292.01 | 1.56X |
[64, 50, 500,500] | 159.5678 | 255.32 | 1.60X |
[16, 25, 512, 512] | 168.2578 | 269.11 | 1.60X |
[32, 25, 512, 512] | 217.3134 | 365.27 | 1.68X |
[64, 25, 512, 512] | 273.1848 | 475.26 | 1.74X |
[128, 25, 512, 512] | 320.5748 | 549.36 | 1.71X |
[16, 25, 1024, 1024] | 62.4692 | 89.46 | 1.43X |
[32, 25, 1024, 1024] | 89.6243 | 144.03 | 1.61X |
[64, 25, 1024, 1024] | 127.414 | 199.49 | 1.57X |
[128, 25, 1024, 1024] | 174.6576 | 255.07 | 1.46X |
[16, 25, 2048, 2048] | 18.8309 | 25.69 | 1.36X |
[32, 25, 2048, 2048] | 30.9957 | 47.01 | 1.52X |
[64, 25, 2048, 2048] | 51.2821 | 75.98 | 1.48X |
[128, 25, 2048, 2048] | 71.7206 | 113.27 | 1.58X |
[16, 25, 4096, 4096] | 6.0788 | 7.46 | 1.23X |
[32, 25, 4096, 4096] | 10.954 | 13.98 | 1.28X |
[64, 25, 4096, 4096] | 18.5955 | 24.85 | 1.34X |
[128, 25, 4096, 4096] | 28.1366 | 39.01 | 1.39X |
Platform: CPU E5-2699 v4
Phase: Inference
Unit: SPS(Scentence per Sencond)
LSTM Input Shape | Xeon E5-2699 OOB | Xeon E5-2699 Optimized | Speed-up |
---|---|---|---|
[64, 15, 500, 500] | 1169.737 | 7149.82 | 6.11X |
[64, 20, 500, 500] | 923.5499 | 6033.54 | 6.53X |
[64, 25, 500,500] | 739.8101 | 4846.39 | 6.55X |
[64, 30, 500,500] | 618.0939 | 4027.08 | 6.52X |
[64, 35, 500,500] | 528.3323 | 3401.53 | 6.44X |
[64, 40, 500,500] | 462.2187 | 2972.32 | 6.43X |
[64, 45, 500,500] | 410.5386 | 2625.95 | 6.40X |
[64, 50, 500,500] | 369.9179 | 2372.84 | 6.41X |
[16, 25, 512, 512] | 639.4213 | 2172.63 | 3.40X |
[32, 25, 512, 512] | 680.3161 | 3561.47 | 5.24X |
[64, 25, 512, 512] | 727.8996 | 4864.45 | 6.68X |
[128, 25, 512, 512] | 760.9095 | 5754.56 | 7.56X |
[16, 25, 1024, 1024] | 320.0169 | 1381.03 | 4.32X |
[32, 25, 1024, 1024] | 349.7738 | 1916.54 | 5.48X |
[64, 25, 1024, 1024] | 368.3568 | 2265 | 6.15X |
[128, 25, 1024, 1024] | 490.1187 | 2518.24 | 5.14X |
[16, 25, 2048, 2048] | 137.989 | 383.87 | 2.78X |
[32, 25, 2048, 2048] | 159.1569 | 590.48 | 3.71X |
[64, 25, 2048, 2048] | 214.677 | 720.81 | 3.36X |
[128, 25, 2048, 2048] | 210.0029 | 683.88 | 3.26X |
[16, 25, 4096, 4096] | 42.7353 | 70.06 | 1.64X |
[32, 25, 4096, 4096] | 66.9777 | 126.43 | 1.89X |
[64, 25, 4096, 4096] | 82.5284 | 180.12 | 2.18X |
[128, 25, 4096, 4096] | 83.1054 | 180.03 | 2.17X |
Platform: CPU E5-2699 v4
Phase: Training
Unit: SPS(Scentence per Sencond)
LSTM Input Shape | Xeon E5-2699 OOB | Xeon E5-2699 Optimized | Speed-up |
---|---|---|---|
[64, 15, 500, 500] | 451.2899 | 627.66 | 1.39X |
[64, 20, 500, 500] | 370.242 | 497.26 | 1.34X |
[64, 25, 500,500] | 298.1386 | 363.61 | 1.22X |
[64, 30, 500,500] | 251.8914 | 327.72 | 1.30X |
[64, 35, 500,500] | 225.749 | 285.99 | 1.27X |
[64, 40, 500,500] | 192.7014 | 271.03 | 1.41X |
[64, 45, 500,500] | 175.5287 | 245.5 | 1.40X |
[64, 50, 500,500] | 161.343 | 229.74 | 1.42X |
[16, 25, 512, 512] | 207.6788 | 201.7 | 0.97X |
[32, 25, 512, 512] | 250.4016 | 301.76 | 1.21X |
[64, 25, 512, 512] | 306.2745 | 429.34 | 1.40X |
[128, 25, 512, 512] | 345.1608 | 456.06 | 1.32X |
[16, 25, 1024, 1024] | 66.2632 | 67.93 | 1.03X |
[32, 25, 1024, 1024] | 37.8289 | 114.71 | 3.03X |
[64, 25, 1024, 1024] | 76.6716 | 173.85 | 2.27X |
[128, 25, 1024, 1024] | 141.6185 | 218 | 1.54X |
[16, 25, 2048, 2048] | 20.5789 | 20.82 | 1.01X |
[32, 25, 2048, 2048] | 34.5047 | 36.93 | 1.07X |
[64, 25, 2048, 2048] | 55.1509 | 62.73 | 1.14X |
[128, 25, 2048, 2048] | 71.7717 | 88.76 | 1.24X |
[16, 25, 4096, 4096] | 6.8679 | 7.09 | 1.03X |
[32, 25, 4096, 4096] | 12.5718 | 13.85 | 1.10X |
[64, 25, 4096, 4096] | 20.1554 | 23.66 | 1.17X |
[128, 25, 4096, 4096] | 27.4074 | 34.49 | 1.26X |
Conclusion:
According to the benchmarks retrieved on Intel Xeon Platforms, On Platinum 8180:
- For LSTM inference (forward-only), the performance is get boosted from 1.25X to 8.22X.
- For LSTM training (forward + backward), the performance is get boosted from 1.23X to 1.74X.
On E5-2699 V4:
- For LSTM inference (forward-only), the performance is get boosted from 1.64X to 7.56X.
- For LSTM training (forward + backward), the performance is get boosted from 1.01X to 3.03X.
Test results analysis:
- For inference benchmarks: As the contributions of elementwise operation varies from the different input shapes, it is expected the performance boosts are not uniform with input shape changing.
- For training benchmarks: Apart from sharing the same reason of inference benchmarks. As the backward computation gains less from the elementwise optimization, it is expected the performance boosts on training benchmarks are not outstanding as inference benchmarks, and not uniform with input shape changing.