Hacker News new | past | comments | ask | show | jobs | submit login
Waveglow Inference in CUDA C++ (github.com/saurabh-29)
72 points by Saurabh_29 on Sept 8, 2019 | hide | past | favorite | 10 comments



The main bottleneck is the data transfer speed between the GPU and the SMs. Also, using tensor core doesn't necessarily apply using half-precision as now NVIDIA supports single-precision operation in Tensorcore too.


I'm interested in this. does anyone know in general where the bottle necks are on a forward pass in pytorch?


In terms of writing the code in Python vs C++ (or a compiled language), there is a good interesting post about this here: https://stackoverflow.com/questions/7931314/pycuda-vs-c-perf...


The biggest thing is using lower precision

> this implementation gives 25% speedup over Nvidia's Pytorch implementation in full precision and 2.5-3x speedup when using TensorCore

TensorCore is a lower precision core.

Other than that, the speedups presumably come from better written CUDA.

If you're asking what the bottlenecks are in general for these kinds of kernels, it's pretty much always memory bound.

View page 3-5 for what kinds of optimizations need to be done: https://people.csail.mit.edu/jrk/jrkthesis.pdf


I will agree with you. It is a combination of both. Plus I would like to add some extra points, pytorch/tensorflow essentially use the same CUDA/Cudnn libraries, the thing they are developed with the motive to catering wide corner cases, which tend to be robust but slower at times because of the memory access patterns/algorithm selection heuristics, some extra operations. Also, we can get rid of many memory read/write operations by clubbing kernels.


A bottleneck can be running the model only on GPU, where CPU is more efficient. But most of the bottlenecks are memory issues. GPUs do not necessarily have enough memory and so you end up having to access "external memory" that slows down forward pass a ton


Also, in some cases like small RNN/LSTMs, CPU's can be faster.


The main bottleneck is the time spent in adding bias after Conv/dense. A well-optimized code can remove that bottleneck. I haven't done it in my code as it makes it less readable.


I would say passing data back and forward from the GPUs


+1, it is the biggest bottleneck at times




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: