In what circumstance would one measure end to end time budget in training? What ...

In what circumstance would one measure end to end time budget in training? What would that metric tell you? You don't care about latency, you care about throughput, which can be scaled nearly completely independently of the "wrapper language" for lack of a better term, in this case that's python.

It seems some commenters on this thread have not really thought through the lifecycle of a learned model and the tradeoffs existing frameworks exploit to make things fast _and_ easy to use. In training we care about throughput. Thats great because we can use a high level DSL to construct some graph that trains in a highly concurrent execution mode or on dedicated hardware. Using the high level DSL is what allows us to abstract away these details and still get good training throughput. Tradeoffs still bleed out of the abstraction (think batch size, network size, architecture etc have effect on how efficient certain hardware will be) but that is inevitable when you're moving from CPU to GPU to ASIC.

When you are done training and you want to use the model in a low latency environment you use a c/c++ binary to serve it. Latency matters there, so exploit the fact that you're no longer defining a model (no need for a fancy DSL) and just serve it from a very simple but highly optimized API.