I thought all frameworks can do this type of profiling (e.g. torch.backends.cudn...

m0zg · on April 10, 2020

TRT profiling is more extensive. On the model I'm currently working with (which runs on Jetson Xavier), initial TRT profiling takes something like 4 minutes. The model is an object detector. You can save the result, but it's hardware dependent then, so the resulting model is only optimal for the particular hardware it was optimized for. I cache it on disk - I can't wait 4 minutes every time I run a test.

PyTorch, as far as I can tell, does much lighter cuDNN profiling. It's more pareto optimal, I suppose, but the benefit is nowhere near as significant.

Another framework which does amazing optimization (but on the CPU) is OpenVINO. Normally I don't expect much on the software side from Intel, but this thing really blows the doors off everything else if you don't have a GPU at your disposal, provided that you have an Intel processor. The wayt they do it is they generate kernels that fit your data, but not the way XLA does it. They hand code them in a DSL that produces assembly, using Xbyak, and incorporate their deep knowledge of Intel hardware into that. When it's time to run the model, that DSL spits out optimal kernels just for that particular model. It's pretty neat work, IMO.

p1esk · on April 10, 2020

I see, thanks. In my day job I develop hardware accurate simulations of a deep learning accelerator. This involves looking at a Spice model, simplifying it into some set of abstractions using Numpy, then accelerating this Numpy code using GPUs. Currently I'm porting a resnet-50 model from Numpy to Pytorch, and the next step is to speed up the Pytorch code (because right now I get ~1 image per second, which is about 10 times better than Numpy). Perhaps I should look into porting the model from Pytorch to TensorRT.

m0zg · on April 11, 2020

If you're working with pytorch, porting basically means export to ONNX. Sometimes you'll run into an op that doesn't work with ONNX, but there are a lot fewer of those in TRT7. Unfortunately I have to work with TRT6, so I have to use PyTorch 1.2 and be "creative" to work around TRT6 bugs. That said, it could very well be painless for you. No reason not to try. Just export the model, and benchmark it with `trtexec`, in both fp32 and fp16. An hour of work at most.