TPUs aren't necessarily a pro. They go back 15 years and don't seem to have yielded any kind of durable advantage. Developing them is expensive but their architecture was often over-fit to yesterday's algorithms which is why they've been through so many redesigns. Their competitors have routinely moved much faster using CUDA.
Once the space settles down, the balance might tip towards specialized accelerators but NVIDIA has plenty of room to make specialized silicon and cut prices too. Google has still to prove that the TPU investment is worth it.
Not sure how familiar you are with the internal situation... But from my experience think it's safe to say that TPU basically multiplies Google's computation capability by 10x, if not 20x. Also they don't need to compete with others to secure expensive nvidia chips. If this is not an advantage, I don't see there's anything considered to be an advantage. The entire point of vertical integration is to secure full control of your stack so your capability won't be limited by potential competitors, and TPU is one of the key component of its strategy.
Also worth noting that its Ads division is the largest, heaviest user of TPU. Thanks to it, it can flex running a bunch of different expensive models that you cannot realistically afford with GPU. The revenue delta from this is more than enough to pay off the entire investment history for TPU.
They must very much compete with others. All these chips are being fabbed at the same facilities in Taiwan and capacity trades off against each other. Google has to compete for the same fab capacity alongside everyone else, as well as skilled chip designers etc.
> The revenue delta from this is more than enough to pay off the entire investment history for TPU.
Possibly; such statements were common when I was there too but digging in would often reveal that the numbers being used for what things cost, or how revenue was being allocated, were kind of ad hoc and semi-fictional. It doesn't matter as long as the company itself makes money, but I heard a lot of very odd accounting when I was there. Doubtful that changed in the years since.
Regardless the question is not whether some ads launches can pay for the TPUs, the question is whether it'd have worked out cheaper in the end to just buy lots of GPUs. Answering that would require a lot of data that's certainly considered very sensitive, and makes some assumptions about whether Google could have negotiated private deals etc.
> They must very much compete with others. All these chips are being fabbed at the same facilities in Taiwan and capacity trades off against each other.
I'm not sure what you're trying to deliver here. Following your logic, even if you have a fab you need to compete for rare metals, ASML etc etc... That's a logic built for nothing but its own sake. In the real world, it is much easier to compete outside Nvidia's own allocation as you get rid of the critical bottleneck. And Nvidia has all the incentives to control the supply to maximize its own profit, not to meet the demands.
> Possibly; such statements were common when I was there too but digging in would often reveal that the numbers being used for what things cost, or how revenue was being allocated, were kind of ad hoc and semi-fictional.
> Regardless the question is not whether some ads launches can pay for the TPUs, the question is whether it'd have worked out cheaper in the end to just buy lots of GPUs.
Of course everyone can build their own narratives in favor of their launch, but I've been involved in some of those ads quality launches and can say pretty confidently that most of those launches would not be launchable without TPU at all. This was especially true in the early days of TPU as the supply of GPU for datacenter was extremely limited and immature.
More GPU can solve? Companies are talking about 100k~200k of H100 as a massive cluster and Google already has much larger TPU clusters with computation capability in a different order of magnitudes. The problem is, you cannot simply buy more computation even if you have lots of money. I've been pretty clear about how relying on Nvidia's supply could be a critical limiting factor in a strategic point of view but you're trying to move the point. Please don't.
So are the electric and cooling costs at Google's scale. Improving perf-per-watt efficiency can pay for itself. The fact that they keep iterating on it suggests it's not a negative-return exercise.
TPUs probably can pay for themselves, especially given NVIDIA's huge margins. But it's not a given that it's so just because they fund it. When I worked there Google routinely funded all kinds of things without even the foggiest idea of whether it was profitable or not. There was just a really strong philosophical commitment to doing everything in house no matter what.
> When I worked there Google routinely funded all kinds of things without even the foggiest idea of whether it was profitable or not.
You're talking about small-money bets. The technical infrastructure group at Google makes a lot of them, to explore options or hedge risks, but they only scale the things that make financial sense. They aren't dumb people after all.
The TPU was a small-money bet for quite a few years until this latest AI boom.
Maybe it's changed. I'm going back a long way but part of my view on this was shaped by an internal white paper written by an engineer who analyzed the cost of building a Gmail clone using commodity tech vs Google's in house approach, this was maybe circa 2010. He didn't even look at people costs, just hardware, and the commodity tech stack smoked Gmail's on cost without much difference in features (this was focused on storage and serving, not spam filtering where there was no comparably good commodity solution).
The cost delta was massive and really quite astounding to see spelled out because it was hardly talked about internally even after the paper was written. And if you took into account the very high comp Google engineers got, even back then when it was lower than today, the delta became comic. If Gmail had been a normal business it'd have been outcompeted on price and gone broke instantly, the cost disadvantage was so huge.
The people who built Gmail were far from dumb but they just weren't being measured on cost efficiency at all. The same issues could be seen at all levels of the Google stack at that time. For instance, one reason for Gmail's cost problem was that the underlying shared storage systems like replicated BigTables were very expensive compared to more ordinary SANs. And Google's insistence on being able to take clusters offline at will with very little notice required a higher replication factor than a normal company would have used. There were certainly benefits in terms of rapid iteration on advanced datacenter tech, but did every product really need such advanced datacenters to begin with? Probably not. The products I worked on didn't seem to.
Occasionally we'd get a reality check when acquiring companies and discovering they ran competitive products on what was for Google an unimaginably thrifty budget.
So Google was certainly willing to scale things up that only made financial sense if you were in an environment totally unconstrained by normal budgets. Perhaps the hardware divisions operate differently, but it was true of the software side at least.
The issue isn't number of designs but architectural stability. NVIDIA's chips have been general purpose for a long time. They get faster and more powerful but CUDA has always been able to run any kind of neural network. TPUs used to be over-specialised to specific NN types and couldn't handle even quite small evolutions in algorithm design whereas NVIDIA cards could. Google has used a lot of GPU hardware too, as a consequence.
At the same time if the TPU didn't exist NVIDIA would pretty much have a complete monopoly on the market.
While Nv does have an unlimited money printer at the moment, the fact that at least some potential future competition exists does represent a threat to that.
Depending how you count, parent comment is accurate. Hardware doesn't just appear. 4 years of planning and R&D for the first generation chip is probably right.
I was wrong, ironically because Google's AI overview says it's 15 years if you search. The article it's quoting from appears to be counting the creation of TensorFlow as an "origin".
Once the space settles down, the balance might tip towards specialized accelerators but NVIDIA has plenty of room to make specialized silicon and cut prices too. Google has still to prove that the TPU investment is worth it.