Home

Awesome

PyPI Downloads

CUDA-Warp RNN-Transducer

A GPU implementation of RNN Transducer (Graves 2012, 2013). This code is ported from the reference implementation (by Awni Hannun) and fully utilizes the CUDA warp mechanism.

The main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm. In particular, there is a nested loop to populate a lattice with shape (T, U), and each value in this lattice depend on the two previous cells from each dimension (e.g. forward pass).

CUDA executes threads in groups of 32 parallel threads called warps. Full efficiency is realized when all 32 threads of a warp agree on their execution path. This is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension. In each warp, variables between threads exchanged using a fast operations. As soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running. A schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size. The similar procedure for the backward pass runs in parallel.

Performance

NVIDIA Profiler shows advantage of the warp implementation over the non-warp implementation.

This warp implementation:

Non-warp implementation warp-transducer:

Unfortunately, in practice this advantage disappears because the memory operations takes much longer. Especially if you synchronize memory on each iteration.

warp_rnnt (gather=False)warp_rnnt (gather=True)warprnnt_pytorchtransducer (CPU)
T=150, U=40, V=28
N=10.50 ms0.54 ms0.63 ms1.28 ms
N=161.79 ms1.72 ms1.85 ms6.15 ms
N=323.09 ms2.94 ms2.97 ms12.72 ms
N=645.83 ms5.54 ms5.23 ms23.73 ms
N=12811.30 ms10.74 ms9.99 ms47.93 ms
T=150, U=20, V=5000
N=10.95 ms0.80 ms1.74 ms21.18 ms
N=168.74 ms6.24 ms16.20 ms240.11 ms
N=3217.26 ms12.35 ms31.64 ms490.66 ms
N=64out-of-memoryout-of-memoryout-of-memory944.73 ms
N=128out-of-memoryout-of-memoryout-of-memory1894.93 ms
T=1500, U=300, V=50
N=15.89 ms4.99 ms10.02 ms121.82 ms
N=1695.46 ms78.88 ms76.66 ms732.50 ms
N=32out-of-memory157.86 ms165.38 ms1448.54 ms
N=64out-of-memoryout-of-memoryout-of-memory2767.59 ms

Benchmarked on a GeForce RTX 2070 Super GPU, Intel i7-10875H CPU @ 2.30GHz.

Note

Install

There are two bindings for the core algorithm:

Reference