So are the electric and cooling costs at Google's scale. Improving perf-per-watt efficiency can pay for itself. The fact that they keep iterating on it suggests it's not a negative-return exercise.
TPUs probably can pay for themselves, especially given NVIDIA's huge margins. But it's not a given that it's so just because they fund it. When I worked there Google routinely funded all kinds of things without even the foggiest idea of whether it was profitable or not. There was just a really strong philosophical commitment to doing everything in house no matter what.
> When I worked there Google routinely funded all kinds of things without even the foggiest idea of whether it was profitable or not.
You're talking about small-money bets. The technical infrastructure group at Google makes a lot of them, to explore options or hedge risks, but they only scale the things that make financial sense. They aren't dumb people after all.
The TPU was a small-money bet for quite a few years until this latest AI boom.
Maybe it's changed. I'm going back a long way but part of my view on this was shaped by an internal white paper written by an engineer who analyzed the cost of building a Gmail clone using commodity tech vs Google's in house approach, this was maybe circa 2010. He didn't even look at people costs, just hardware, and the commodity tech stack smoked Gmail's on cost without much difference in features (this was focused on storage and serving, not spam filtering where there was no comparably good commodity solution).
The cost delta was massive and really quite astounding to see spelled out because it was hardly talked about internally even after the paper was written. And if you took into account the very high comp Google engineers got, even back then when it was lower than today, the delta became comic. If Gmail had been a normal business it'd have been outcompeted on price and gone broke instantly, the cost disadvantage was so huge.
The people who built Gmail were far from dumb but they just weren't being measured on cost efficiency at all. The same issues could be seen at all levels of the Google stack at that time. For instance, one reason for Gmail's cost problem was that the underlying shared storage systems like replicated BigTables were very expensive compared to more ordinary SANs. And Google's insistence on being able to take clusters offline at will with very little notice required a higher replication factor than a normal company would have used. There were certainly benefits in terms of rapid iteration on advanced datacenter tech, but did every product really need such advanced datacenters to begin with? Probably not. The products I worked on didn't seem to.
Occasionally we'd get a reality check when acquiring companies and discovering they ran competitive products on what was for Google an unimaginably thrifty budget.
So Google was certainly willing to scale things up that only made financial sense if you were in an environment totally unconstrained by normal budgets. Perhaps the hardware divisions operate differently, but it was true of the software side at least.
So are the electric and cooling costs at Google's scale. Improving perf-per-watt efficiency can pay for itself. The fact that they keep iterating on it suggests it's not a negative-return exercise.