Waveglow Inference in CUDA C++

Saurabh_29 · on Sept 10, 2019

The main bottleneck is the data transfer speed between the GPU and the SMs. Also, using tensor core doesn't necessarily apply using half-precision as now NVIDIA supports single-precision operation in Tensorcore too.

mlevental · on Sept 8, 2019

I'm interested in this. does anyone know in general where the bottle necks are on a forward pass in pytorch?

giacaglia · on Sept 8, 2019

In terms of writing the code in Python vs C++ (or a compiled language), there is a good interesting post about this here: https://stackoverflow.com/questions/7931314/pycuda-vs-c-perf...

chillee · on Sept 8, 2019

The biggest thing is using lower precision

> this implementation gives 25% speedup over Nvidia's Pytorch implementation in full precision and 2.5-3x speedup when using TensorCore

TensorCore is a lower precision core.

Other than that, the speedups presumably come from better written CUDA.

If you're asking what the bottlenecks are in general for these kinds of kernels, it's pretty much always memory bound.

View page 3-5 for what kinds of optimizations need to be done: https://people.csail.mit.edu/jrk/jrkthesis.pdf

Saurabh_29 · on Sept 10, 2019

I will agree with you. It is a combination of both. Plus I would like to add some extra points, pytorch/tensorflow essentially use the same CUDA/Cudnn libraries, the thing they are developed with the motive to catering wide corner cases, which tend to be robust but slower at times because of the memory access patterns/algorithm selection heuristics, some extra operations. Also, we can get rid of many memory read/write operations by clubbing kernels.

giacaglia · on Sept 8, 2019

A bottleneck can be running the model only on GPU, where CPU is more efficient. But most of the bottlenecks are memory issues. GPUs do not necessarily have enough memory and so you end up having to access "external memory" that slows down forward pass a ton

Saurabh_29 · on Sept 10, 2019

Also, in some cases like small RNN/LSTMs, CPU's can be faster.

Saurabh_29 · on Sept 10, 2019

The main bottleneck is the time spent in adding bias after Conv/dense. A well-optimized code can remove that bottleneck. I haven't done it in my code as it makes it less readable.

davidenunes · on Sept 8, 2019

I would say passing data back and forward from the GPUs

Saurabh_29 · on Sept 10, 2019

+1, it is the biggest bottleneck at times