Awesome

Pytorch element-wise operation optimization benchmark

1. Abstract

Providing a benchmark for element-wise operation performance evaluation on CPU.

Tested CPU：

CPU Model	Sockets	Cores/Socket	Frequency
Intel(R) Xeon(R) CPU E5-2699 v4	2	22	2.20GHz
Intel(R) Xeon(R) Platinum 8180 CPU	2	28	2.50GHz
Intel(R) Core(TM) i7-5960X CPU	1	8	3.00GHz

Tested operations:


copy	add	div	sin	exp	sum	prod

Conclusions:

OpenMP threshold which is set to 100k in official version is too high for contiguous tensors of small and medium size to benefit from OpenMP parallelism.
Discontiguous tensors' operations can be boosted significantly by Intel Pytorch .
The optimal OpenMP threshold is dependent on the operation type and CPU type.
- OpenMP threshold becomes smaller for more complex operations.
- OpenMP threshold of discontiguous tensor is usually lower than that of contiguous tensor.

annotation:
OpenMP threshold -- If the size of a tensor is larger than the value, the operations run in parallel, otherwise in serial.

This benchmark also gives a rough estimation of optimal OpenMP threshold of copy, add, div, exp, sin, sum and prod operation on different types of CPU.

For contiguous tensor operation:

	Xeon(R) Platinum 8180 CPU	Xeon(R) CPU E5-2699 v4	i7-5960X CPU
copy	80k	20k	8k
add	80k	20k	8k
div	50k	10k	2k
exp	1k	1k	1k
sin	1k	1k	1k
sum	1k	1k	1k
prod	1k	1k	1k

For discontiguous tensor operation:

	Xeon(R) Platinum 8180 CPU	Xeon(R) CPU E5-2699 v4	i7-5960X CPU
copy	20k	8k	2k
add	20k	8k	2k
div	10k	8k	1k
exp	1k	1k	1k
sin	2k	2k	1k
sum	1k	1k	1k
prod	1k	1k	1k

2. Major work

Optimal OpenMP threshold is identified to fully exploit performance potentiality on CPU
The OpenMP threshold of official Pytorch is set to 100K. However, the evidence gained by benchmarking copy, add, div, exp, sin operation in both contiguous and discontiguous cases on different CPU types shows that the value is too high. A rough estimation of optimal OpenMP threshold is also proposed for those operations.
Discontiguous tensors' operation parallelization with OpenMP
Slice operation of tensor is very common in science computation. Using slice operation will generate discontiguous tensor. Meanwhile, Official Pytorch does not support parallelism of discontiguous tensor at the moment. Our main work is trying to fill this blank. Code available at dev-omp and upstreaming is in progress.

3. Installation and test

3.1 Installation

Official Pytorch

Please refer to official link

Intel Pytorch

Download Intel pytorch source code.

git clone --recursive -b dev-omp2 https://github.com/intel/pytorch.git

Before installing, you should set the CMAKE_PREFIX_PATH.

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

Install intel Pytorch

python setup.py install

3.2 Test

python benchmark.py <CONTIGUITY> <OPERATION> [OUTPUT FILENAME]

Positional arguments:
CONTIUITY—— operands' contiguity, ontiguous/discontiguous
OPERATION—— operation, copy/add/div/sin/exp/sum/prod

Optional arguments:
o output filename——output filename, output.log is in default

4. The benchmark result

4.1 Contiguous Tensor Operation OpenMP Threshold Tuning

Add, exp operation for contiguous tensors whose sizes range from 1K to 100K are listed here as test cases. We compiled two versions of official Pytorch by setting two different OpenMP threshold. The threshold of one version is set to 100K to make all of the test case runs in series. Meanwhile the threshold of the other one is set to 800 to make all of the test case in parallel.

Platform: Platinum 8180
Operation: add
Tensor Continuity: contiguous
Unit: microsecond

Time cost result is below:

Tensor Size	In series	In parallel	SpeedUp
1k	1.04	5.15	0.20X
2k	1.23	5.47	0.22X
3k	1.33	5.34	0.24X
4k	1.47	5.41	0.27X
5k	1.48	5.40	0.27X
8k	1.81	5.55	0.32X
10k	1.98	5.66	0.35X
20k	2.74	6.74	0.40X
50k	5.12	6.59	0.77X
80k	14.79	6.59	2.24X
100k	21.97	6.70	3.27X

Conclusion: Setting the threshold to 80K is good for add operation of contiguous tensors.

Platform: Platinum 8180
Operation: exp
Tensor Continuity: contiguous
Unit: microsecond

Time cost result is below:

Tensor Size	In series	In parallel	SpeedUp
1k	9.48	5.66	1.67X
2k	17.00	6.35	2.67X
3k	24.82	6.03	4.11X
4k	32.52	6.28	5.17X
5k	40.33	6.27	6.42X
8k	63.58	7.04	9.02X
10k	79.13	7.61	10.38X
20k	156.78	9.11	17.20X
50k	387.85	15.07	25.73X
80k	623.34	20.23	30.80X
100k	779.95	23.57	33.08X

Conclusion: Setting the threshold to 1K is good for exponential operation of contiguous tensors.

From above results, it is easy to understand that,

Different operations have their own optimal OpenMP threshold, but 100K is not suitable.
OpenMP threshold becomes smaller for more complex operations.

We don't list all the detailed data for div, sin, sum and prod operation but provide a rough estimation of optimal OpenMP threshold for different operations.

4.2 Discontiguous tensor operation parallelization

Add and exp operation performance for discontiguous tensors whose sizes range from 1k to 180k are listed. Official pytorch does not optimize operations for discontiguous tensors with OpenMP but Intel version does. In order to expalin that OpenMP also do good in discontiguous tensor operations and to find a optimal OpenMP threshold, we compiled two versions of Pytorch. One is the Official Pytorch. The other one is the Intel one whose OpenMP threshold is set to 800 to make all test cases run in parallel.

Platform: Platinum 8180
Operation: add
Tensor Continuity: discontiguous
Unit: microsecond

Time cost result is below:

Tensor Size	In series	In parallel	SpeedUp
1k	1.69	6.98	0.24X
2k	2.42	7.47	0.32X
3k	3.12	7.38	0.42X
4k	3.77	7.43	0.50X
5k	4.46	7.47	0.59X
8k	6.44	7.49	0.85X
10k	7.82	7.69	1.01X
20k	14.54	7.80	1.86X
50k	34.35	8.31	4.13X
80k	54.80	8.68	6.31X
100k	68.82	9.07	7.58X
110k	75.92	8.99	8.43X
120k	83.03	9.52	8.71X
150k	104.24	9.92	10.50X
180k	124.28	10.68	11.62X

Conclusion: Setting the threshold to 10K is good for add operation of discontiguous tensors.

Platform: Platinum 8180
Operation: exp
Tensor Continuity: discontiguous
Unit: microsecond

Time cost result is below:

Tensor Size	In series	In parallel	SpeedUp
1k	10.02	7.27	1.37X
2k	19.01	7.83	2.42X
3k	27.73	7.48	3.70X
4k	36.45	7.66	4.75X
5k	45.26	8.13	5.56X
8k	71.36	8.70	8.19X
10k	88.75	9.15	9.69X
20k	176.26	11.32	15.56X
50k	439.68	19.07	23.04X
80k	700.40	26.99	25.94X
100k	876.42	27.61	31.73X
110k	983.76	29.79	33.01X
120k	1050.07	31.87	32.94X
150k	1341.23	37.59	35.67X
180k	1584.88	43.27	36.62X

Conclusion: Setting the threshold to 1K is good exponential operation of contiguous tensors.

Conclusions:

Discontiguous operation can be improved a lot by using OpenMP optimization.
OpenMP threshold of discontiguous tensor is usually lower than that of contiguous tensor because the same operation of discontiguous tensor is more time-consuming than contiguous tensor.

4.3 LSTM benchmark test

To consolidate the performance boost benefiting from the elementwise optimization, we choose the a widely-used RNN unit: LSTM as the model-level benchmark reference. This is because:

LSTM related computations involve considerable elementwise operations;
PyTorch provides a scalable and flexible Python API to execute LSTM computation.

We retrieve the LSTM benchmark via the script: https://github.com/xhzhao/pytorch-rnn-benchmark , and in which,

The Python API torch.nn.LSTM is used as the entry of LSTM computation.
We run the benchmarks on 24 selective input shapes utilized by different NLP models,
The unit for benchmarks is Sentence Per Second (SPS). [N, T, D, Z] stands for batch size, embedding size, sentence length and hidden size. Specifically, The [64, 50, 500, 500] is used by OpenNMT. The [64, 25, 4096, 4096] is used by Deepbench.

Platform: Platinum-8180
Phase: Inference
Unit: SPS(Scentence per Sencond)

LSTM Input Shape	Xeon Platinum 8180 OOB	Xeon Platinum 8180 Optimized	SpeedUp
[64, 15, 500, 500]	899.4494	7393.76	8.22X
[64, 20, 500, 500]	937.1688	5895.53	6.29X
[64, 25, 500,500]	750.8159	4808.17	6.40X
[64, 30, 500,500]	625.825	2351.56	3.76X
[64, 35, 500,500]	536.1393	3446.69	6.43X
[64, 40, 500,500]	469.1356	2907.74	6.20X
[64, 45, 500,500]	417.338	2502.57	6.00X
[64, 50, 500,500]	375.6814	2412.96	6.43X
[16, 25, 512, 512]	474.9601	1325.45	2.79X
[32, 25, 512, 512]	606.5853	2394.69	3.95X
[64, 25, 512, 512]	700.1314	3661.21	5.23X
[128, 25, 512, 512]	771.5298	4931.85	6.39X
[16, 25, 1024, 1024]	195.6518	434.34	2.22X
[32, 25, 1024, 1024]	261.1828	792.48	3.03X
[64, 25, 1024, 1024]	323.7316	1174.23	3.62X
[128, 25, 1024, 1024]	458.3642	1793.54	3.91X
[16, 25, 2048, 2048]	48.7229	71.07	1.46X
[32, 25, 2048, 2048]	77.4796	131.74	1.70X
[64, 25, 2048, 2048]	132.8328	245.78	1.85X
[128, 25, 2048, 2048]	178.2548	429.59	2.41X
[16, 25, 4096, 4096]	12.4995	16.99	1.36X
[32, 25, 4096, 4096]	23.0582	28.89	1.25X
[64, 25, 4096, 4096]	39.3725	53.48	1.36X
[128, 25, 4096, 4096]	61.866	97.97	1.58X

Platform: Platinum-8180
Phase: Training
Unit: SPS(Scentence per Sencond)

LSTM Input Shape	Xeon Platinum 8180 OOB	Xeon Platinum 8180 Optimized	Speed-up
[64, 15, 500, 500]	432.5038	740.19	1.71X
[64, 20, 500, 500]	385.2532	506.49	1.31X
[64, 25, 500,500]	308.066	476.33	1.55X
[64, 30, 500,500]	264.2467	406.49	1.54X
[64, 35, 500,500]	217.2079	362.4	1.67X
[64, 40, 500,500]	199.5474	321.25	1.61X
[64, 45, 500,500]	187.0923	292.01	1.56X
[64, 50, 500,500]	159.5678	255.32	1.60X
[16, 25, 512, 512]	168.2578	269.11	1.60X
[32, 25, 512, 512]	217.3134	365.27	1.68X
[64, 25, 512, 512]	273.1848	475.26	1.74X
[128, 25, 512, 512]	320.5748	549.36	1.71X
[16, 25, 1024, 1024]	62.4692	89.46	1.43X
[32, 25, 1024, 1024]	89.6243	144.03	1.61X
[64, 25, 1024, 1024]	127.414	199.49	1.57X
[128, 25, 1024, 1024]	174.6576	255.07	1.46X
[16, 25, 2048, 2048]	18.8309	25.69	1.36X
[32, 25, 2048, 2048]	30.9957	47.01	1.52X
[64, 25, 2048, 2048]	51.2821	75.98	1.48X
[128, 25, 2048, 2048]	71.7206	113.27	1.58X
[16, 25, 4096, 4096]	6.0788	7.46	1.23X
[32, 25, 4096, 4096]	10.954	13.98	1.28X
[64, 25, 4096, 4096]	18.5955	24.85	1.34X
[128, 25, 4096, 4096]	28.1366	39.01	1.39X

Platform: CPU E5-2699 v4
Phase: Inference
Unit: SPS(Scentence per Sencond)

LSTM Input Shape	Xeon E5-2699 OOB	Xeon E5-2699 Optimized	Speed-up
[64, 15, 500, 500]	1169.737	7149.82	6.11X
[64, 20, 500, 500]	923.5499	6033.54	6.53X
[64, 25, 500,500]	739.8101	4846.39	6.55X
[64, 30, 500,500]	618.0939	4027.08	6.52X
[64, 35, 500,500]	528.3323	3401.53	6.44X
[64, 40, 500,500]	462.2187	2972.32	6.43X
[64, 45, 500,500]	410.5386	2625.95	6.40X
[64, 50, 500,500]	369.9179	2372.84	6.41X
[16, 25, 512, 512]	639.4213	2172.63	3.40X
[32, 25, 512, 512]	680.3161	3561.47	5.24X
[64, 25, 512, 512]	727.8996	4864.45	6.68X
[128, 25, 512, 512]	760.9095	5754.56	7.56X
[16, 25, 1024, 1024]	320.0169	1381.03	4.32X
[32, 25, 1024, 1024]	349.7738	1916.54	5.48X
[64, 25, 1024, 1024]	368.3568	2265	6.15X
[128, 25, 1024, 1024]	490.1187	2518.24	5.14X
[16, 25, 2048, 2048]	137.989	383.87	2.78X
[32, 25, 2048, 2048]	159.1569	590.48	3.71X
[64, 25, 2048, 2048]	214.677	720.81	3.36X
[128, 25, 2048, 2048]	210.0029	683.88	3.26X
[16, 25, 4096, 4096]	42.7353	70.06	1.64X
[32, 25, 4096, 4096]	66.9777	126.43	1.89X
[64, 25, 4096, 4096]	82.5284	180.12	2.18X
[128, 25, 4096, 4096]	83.1054	180.03	2.17X

Platform: CPU E5-2699 v4
Phase: Training
Unit: SPS(Scentence per Sencond)

LSTM Input Shape	Xeon E5-2699 OOB	Xeon E5-2699 Optimized	Speed-up
[64, 15, 500, 500]	451.2899	627.66	1.39X
[64, 20, 500, 500]	370.242	497.26	1.34X
[64, 25, 500,500]	298.1386	363.61	1.22X
[64, 30, 500,500]	251.8914	327.72	1.30X
[64, 35, 500,500]	225.749	285.99	1.27X
[64, 40, 500,500]	192.7014	271.03	1.41X
[64, 45, 500,500]	175.5287	245.5	1.40X
[64, 50, 500,500]	161.343	229.74	1.42X
[16, 25, 512, 512]	207.6788	201.7	0.97X
[32, 25, 512, 512]	250.4016	301.76	1.21X
[64, 25, 512, 512]	306.2745	429.34	1.40X
[128, 25, 512, 512]	345.1608	456.06	1.32X
[16, 25, 1024, 1024]	66.2632	67.93	1.03X
[32, 25, 1024, 1024]	37.8289	114.71	3.03X
[64, 25, 1024, 1024]	76.6716	173.85	2.27X
[128, 25, 1024, 1024]	141.6185	218	1.54X
[16, 25, 2048, 2048]	20.5789	20.82	1.01X
[32, 25, 2048, 2048]	34.5047	36.93	1.07X
[64, 25, 2048, 2048]	55.1509	62.73	1.14X
[128, 25, 2048, 2048]	71.7717	88.76	1.24X
[16, 25, 4096, 4096]	6.8679	7.09	1.03X
[32, 25, 4096, 4096]	12.5718	13.85	1.10X
[64, 25, 4096, 4096]	20.1554	23.66	1.17X
[128, 25, 4096, 4096]	27.4074	34.49	1.26X

Conclusion:

According to the benchmarks retrieved on Intel Xeon Platforms, On Platinum 8180:

For LSTM inference (forward-only), the performance is get boosted from 1.25X to 8.22X.
For LSTM training (forward + backward), the performance is get boosted from 1.23X to 1.74X.

On E5-2699 V4:

For LSTM inference (forward-only), the performance is get boosted from 1.64X to 7.56X.
For LSTM training (forward + backward), the performance is get boosted from 1.01X to 3.03X.

Test results analysis:

For inference benchmarks: As the contributions of elementwise operation varies from the different input shapes, it is expected the performance boosts are not uniform with input shape changing.
For training benchmarks: Apart from sharing the same reason of inference benchmarks. As the backward computation gains less from the elementwise optimization, it is expected the performance boosts on training benchmarks are not outstanding as inference benchmarks, and not uniform with input shape changing.