I am using Python's TensorFlow API from C# through my own binding, and I don't understand how you got 250ms latency on C API without screwing up on your side. I could effortlessly run a super real-time network playing a video game with soft actor-critic with my setup on 1.15.
We didn't understand either. One thing, our cloud instances had no GPU's. Our model wasn't exactly complex, just maybe 15 float and string scalar inputs.