It would be unfortunate if future process improvements resulted in fragile CPUs ...

fouc · on June 15, 2020

I think that professional tier hardware generally comes with some sort of guarantees on how long they last, so data centers shouldn't be affected much.

teruakohatu · on June 15, 2020

There is a market for used cards. It is one way to get a relatively slow but high RAM compute card without paying extreme prices. Killing that market would force a lot of non-corporate users to start coughing up for extremely expensive new hardware.

me_me_me · on June 15, 2020

Thats correct. And it makes sense. If you buying HW in bulk, you take into account how much power it will draw, what's the expected life expectancy, how much work will given part do and not just the price.

If you get an amazing price for part that fails often, it might cost you way more in long run.

Dylan16807 · on June 15, 2020

In general I would expect a bulk purchaser to be less sensitive to failure.

If I'm buying one drive or CPU, I might pay a premium to drop the failure rate from 4% to 1%. If I'm buying dozens to hook together in a fault-tolerant system, I'll go for the cheap one and buy a few extras.

me_me_me · on June 15, 2020

When you buy 1000 CPU the failure rate of 4% every year. That gives you statistically 40 dead servers each year for 5 years (avg component guaranteed life). thats 200 dead cpus.

Lets say $1000 per cpu thats $1,000,000 cost and $200,000 loss to damage.

1% - 50 dead CPU in 5 years - $50,000 in losses

In this scenario you have $150,000 to save or spend to get better equipment.

Also an important note. On top of that you suffer downtime losses and manpower cost of taking server out and swapping parts. If your eCommerce goes offline that might cause significant monetary loss.

For any large scale purchases its all about the numbers game.

In case of single purchases. Paying extra to get form 4% to 1% seems excessive (depending on the cost increase).

At this percentage levels its a roll of dice if it dies or not.

Dylan16807 · on June 15, 2020

That wasn't supposed to be a per year failure rate. 20% per 5 years is a crazy amount. Divide the numbers by 5 to get a per-year failure closer to my intent.

So if it's $850 for the 4% failure chip, and $1000 for the 1% failure chip, I'll probably buy the cheaper one in bulk. $850k upfront and $34k in replacement, vs. $1000k upfront and $10k in replacement.

There's extra manpower, sure, but even if it costs a hundred dollars of labor per replacement the numbers barely budge.

> downtime losses

In a big system you shouldn't have those from a single server failure! Downtime losses are the point I was making. As someone buying just one drive or chip, failure costs me massive amounts beyond the part itself. But if I can buy enough for redundancy, then those problems become much much smaller.

If you're running things off one server, then apply the single device analysis, not the bulk analysis.

blaser-waffle · on June 15, 2020

Most of the pro grade hardware does come with those assurances, usually with support and replacement contracts as well. And make no mistake, they may replace things a lot -- at my old gig we had a lot of visits from Dell. To Dell's credit, we had a LOT of gear to cover, in several different colo spaces.

Point is though, they price the cost of replacement into those guarantees. Doesn't mean the hardware will last longer, just that support & replacements are

LordHeini · on June 15, 2020

Lets say they could get away with it.

What would be a reasonable time span for this?

Maybe 5 years? I have a gtx970 in my pc which is 5 years old by now. While the card is fine by itself, it is too slow and thus getting replaced in the near future and moved into an office pc.

But which data center uses 5 year old graphics cards?

It is save to assume that dedicated compute card gets replaced from time to time anyway.

Moores law is still too strong.

Reelin · on June 15, 2020

I'm typing this from a ~6 year old laptop (used as a desktop OFC) for example. My phone is ~5 years old (unbelievable, I know). When it fails in a few years I'll happily buy a "new" ~4 year old refurbished phone again.

Regarding Moore's law, there's only so many possible shrinks left to go. Once we hit that wall the incentive to be on the latest node is significantly reduced. Combine that with associated lifetime reductions and I think larger nodes might even end up preferable in many cases.

iforgotpassword · on June 15, 2020

Don't forget preservation efforts. I know most people don't care about it but many enthusiasts like giving old systems a spin every now and then. You can still build your dream 386 from used parts off of eBay and play wing commander on a CRT. For the upcoming generation of hardware that might just be impossible then.

discodave · on June 15, 2020

AWS still has the m1 EC2 instance type on their pricing page which was first launched in... 2006.

Datacenter hardware can stick around for a looooong time if the people running the applications on top don't feel like migrating to new hardware.

ghaff · on June 15, 2020

What makes you think that instance type is still running on the same hardware it was in 2006?

ceejayoz · on June 15, 2020

That they encourage folks to migrate off them and have much more limited supply of the older instance families seems to imply it.

close04 · on June 15, 2020

A 5-10 year old machine can still be perfectly usable. I have a 2012 laptop with a high end i7 3-series CPU, high end Quadro GPU and I would hate if any of them failed because it's not something I can easily (if at all) fix, the whole laptop would become a doorstop.

An i7 6700 is already 5 years old. That's most certainly not an outdated "can throw away" CPU. Neither will a 3rd gen. Ryzen 4 years from now.

0-_-0 · on June 15, 2020

> What would be a reasonable time span for this?

The strong temperature depencence means this will likely look like "5 years with low fan noise and 30 years with high fan noise"

0-_-0 · on June 15, 2020

Since the wear is exponentially dependent on temperature, better cooling (e.g. water cooling) could extend the life of chips significantly, so if someone is worried about chip ageing from continuous use they can just install a water cooler, which is less pricey than an enterprise card.

dmos62 · on June 15, 2020

Maybe industrial water cooling has a better track record, but consumer water cooling is an extremely fiddly and expensive process. My main source is Linus Tech Tips. Unless you're willing to spend a lot of money and effort, air cooling is more effective and much much cheaper. Simple (small) water cooling solutions tend to not perform better than air cooling. Plus, water cooling requires a lot of maintenance, because it's more complex (e.g. there's pumps that can fail) and because the cooling liquid can get contaminated and cause cooling performance to drop.

I can see how it could be cheaper to use cheap air cooling on the chips and efficient, central room cooling.

ncrmro · on June 15, 2020

The day I had to top up my computers water.. I went back to air.

raxxorrax · on June 15, 2020

Spilled some coffee in my open computer case once. Wasn't running faster at all...

But seriously, I think it would be a viable solution for server farms, but it didn't really catch on there yet. Probably still a matter of price. There are some theoretical application with heat exchangers though. If we could recycle some of that, computing would be much more efficient in general.

Dylan16807 · on June 15, 2020

I assume you're talking about centralized cooling for server farms? The solution I like for that is to turn the entire rear door of each cabinet into a water-fed heat exchanger, with no change to the servers. Then your piping is orders of magnitude simpler and safer.

raxxorrax · on June 15, 2020

Probably even better. Water has a nice heat capacity (I think about 10x as much as copper), but maybe that isn't that important for such a solution as long as the heat gets used. Even if we would just get 10% of the invested energy back, it would be a huge boon already.

dmos62 · on June 15, 2020

Heat capacity doesn't really matter, unless you'll be using the device less time than it takes to reach that capacity. If you have two materials with equal thermal conductivity, but different heat capacity, their cooling properties will be the same once both reach their heat capacity.

ncrmro · on June 15, 2020

I think the computers submerged in mineral oil was a good concept for this other that clean up on maintenance.

falcolas · on June 15, 2020

One problem with any form of cooling is that you have to get the heat away from the silicon that's generating it and into the cooling system in the first place. For a lot of complex components (like a CPU), that's very hard to do, since there's dozens (or more, in the case of 3d circuits) of layers of heat-sensitive silicon and metal between any given component and the surface of the heat sink.

AlanSE · on June 15, 2020

Good point, but not for laptops and devices though.