Assuming it's computation bound, it's a factor of 5400 (~13 doublings in CPU power required to get to real-time, assuming no algorithmic improvements).
If I'm not mistaken, it seems that the current limitation is that it needs to be produced sequentially for a dependent sequence of audio, perhaps some independent sentences can be run simultaneously using copies of the net assuming no memory limitations. I wonder if it's already possible to create an auidobook for instance in reasonable time.
Google never stated they use those to train models as far as I know. It seems that they are primarily used to spare energy when deploying trained models at scale.
Theres no reason they couldn't use them to train, as long as they can account for the lower precision operations. I think it would be much cheaper to train on them, at that scale anyway.
Afaik the Google TPU does inference only, at 8 bits. I don't think it's possible to train a neural network at 8 bit precision at this point in time. FP16 works for training though, and is twice as fast as FP32 on certain nvidia chips
Backpropagation can work with any precision, as long as you use stochastic rounding (so that the rounding errors are not correlated.) Without stochastic rounding even 16 bits will have rounding error bias.
I haven't seen 8bit training implemented in any (public) frameworks yet - that's not to say it's not possible. If it works then that's great, especially for specialised hardware.
That doesn't imply they can run WaveNet yet - for inference this net is sort of worst-case serial. Their TPU ASIC is almost certainly highly parallel, like a GPU - actually has to be that way for energy efficiency (which is it's claimed benefit).
Wavenet actually looks like it could possibly have been designed to run on CPUs in production, at least after they can further optimize it some. Sampling is super slow right now because it requires an enormous number of tiny dependent TF ops and thus kernels that have huge overhead for tiny amounts of work. A custom implementation could probably circumvent that by evaluating all the layers sequentially in local cache on a fast CPU.
Or they just designed it without much concern for production plausibility yet.