The main bottleneck is the data transfer speed between the GPU and the SMs.
Also, using tensor core doesn't necessarily apply using half-precision as now NVIDIA supports single-precision operation in Tensorcore too.
I will agree with you. It is a combination of both. Plus I would like to add some extra points, pytorch/tensorflow essentially use the same CUDA/Cudnn libraries, the thing they are developed with the motive to catering wide corner cases, which tend to be robust but slower at times because of the memory access patterns/algorithm selection heuristics, some extra operations.
Also, we can get rid of many memory read/write operations by clubbing kernels.
A bottleneck can be running the model only on GPU, where CPU is more efficient. But most of the bottlenecks are memory issues. GPUs do not necessarily have enough memory and so you end up having to access "external memory" that slows down forward pass a ton
The main bottleneck is the time spent in adding bias after Conv/dense. A well-optimized code can remove that bottleneck. I haven't done it in my code as it makes it less readable.