It would be unfortunate if future process improvements resulted in fragile CPUs and GPUs. I can sort of imagine Nvidia rubbing their hands with glee at the prospect of non-overclocked GPUs aging prematurely, killing used sales and forcing data centers to upgrade to more expensive compute cards.
I think that professional tier hardware generally comes with some sort of guarantees on how long they last, so data centers shouldn't be affected much.
There is a market for used cards. It is one way to get a relatively slow but high RAM compute card without paying extreme prices. Killing that market would force a lot of non-corporate users to start coughing up for extremely expensive new hardware.
Thats correct. And it makes sense. If you buying HW in bulk, you take into account how much power it will draw, what's the expected life expectancy, how much work will given part do and not just the price.
If you get an amazing price for part that fails often, it might cost you way more in long run.
In general I would expect a bulk purchaser to be less sensitive to failure.
If I'm buying one drive or CPU, I might pay a premium to drop the failure rate from 4% to 1%. If I'm buying dozens to hook together in a fault-tolerant system, I'll go for the cheap one and buy a few extras.
When you buy 1000 CPU the failure rate of 4% every year.
That gives you statistically 40 dead servers each year for 5 years (avg component guaranteed life). thats 200 dead cpus.
Lets say $1000 per cpu thats $1,000,000 cost and $200,000 loss to damage.
1% - 50 dead CPU in 5 years - $50,000 in losses
In this scenario you have $150,000 to save or spend to get better equipment.
Also an important note. On top of that you suffer downtime losses and manpower cost of taking server out and swapping parts. If your eCommerce goes offline that might cause significant monetary loss.
For any large scale purchases its all about the numbers game.
In case of single purchases. Paying extra to get form 4% to 1% seems excessive (depending on the cost increase).
At this percentage levels its a roll of dice if it dies or not.
That wasn't supposed to be a per year failure rate. 20% per 5 years is a crazy amount. Divide the numbers by 5 to get a per-year failure closer to my intent.
So if it's $850 for the 4% failure chip, and $1000 for the 1% failure chip, I'll probably buy the cheaper one in bulk. $850k upfront and $34k in replacement, vs. $1000k upfront and $10k in replacement.
There's extra manpower, sure, but even if it costs a hundred dollars of labor per replacement the numbers barely budge.
> downtime losses
In a big system you shouldn't have those from a single server failure! Downtime losses are the point I was making. As someone buying just one drive or chip, failure costs me massive amounts beyond the part itself. But if I can buy enough for redundancy, then those problems become much much smaller.
If you're running things off one server, then apply the single device analysis, not the bulk analysis.
Most of the pro grade hardware does come with those assurances, usually with support and replacement contracts as well. And make no mistake, they may replace things a lot -- at my old gig we had a lot of visits from Dell. To Dell's credit, we had a LOT of gear to cover, in several different colo spaces.
Point is though, they price the cost of replacement into those guarantees. Doesn't mean the hardware will last longer, just that support & replacements are
Maybe 5 years?
I have a gtx970 in my pc which is 5 years old by now. While the card is fine by itself, it is too slow and thus getting replaced in the near future and moved into an office pc.
But which data center uses 5 year old graphics cards?
It is save to assume that dedicated compute card gets replaced from time to time anyway.
I'm typing this from a ~6 year old laptop (used as a desktop OFC) for example. My phone is ~5 years old (unbelievable, I know). When it fails in a few years I'll happily buy a "new" ~4 year old refurbished phone again.
Regarding Moore's law, there's only so many possible shrinks left to go. Once we hit that wall the incentive to be on the latest node is significantly reduced. Combine that with associated lifetime reductions and I think larger nodes might even end up preferable in many cases.
Don't forget preservation efforts. I know most people don't care about it but many enthusiasts like giving old systems a spin every now and then. You can still build your dream 386 from used parts off of eBay and play wing commander on a CRT. For the upcoming generation of hardware that might just be impossible then.
A 5-10 year old machine can still be perfectly usable. I have a 2012 laptop with a high end i7 3-series CPU, high end Quadro GPU and I would hate if any of them failed because it's not something I can easily (if at all) fix, the whole laptop would become a doorstop.
An i7 6700 is already 5 years old. That's most certainly not an outdated "can throw away" CPU. Neither will a 3rd gen. Ryzen 4 years from now.
Since the wear is exponentially dependent on temperature, better cooling (e.g. water cooling) could extend the life of chips significantly, so if someone is worried about chip ageing from continuous use they can just install a water cooler, which is less pricey than an enterprise card.
Maybe industrial water cooling has a better track record, but consumer water cooling is an extremely fiddly and expensive process. My main source is Linus Tech Tips. Unless you're willing to spend a lot of money and effort, air cooling is more effective and much much cheaper. Simple (small) water cooling solutions tend to not perform better than air cooling. Plus, water cooling requires a lot of maintenance, because it's more complex (e.g. there's pumps that can fail) and because the cooling liquid can get contaminated and cause cooling performance to drop.
I can see how it could be cheaper to use cheap air cooling on the chips and efficient, central room cooling.
Spilled some coffee in my open computer case once. Wasn't running faster at all...
But seriously, I think it would be a viable solution for server farms, but it didn't really catch on there yet. Probably still a matter of price. There are some theoretical application with heat exchangers though. If we could recycle some of that, computing would be much more efficient in general.
I assume you're talking about centralized cooling for server farms? The solution I like for that is to turn the entire rear door of each cabinet into a water-fed heat exchanger, with no change to the servers. Then your piping is orders of magnitude simpler and safer.
Probably even better. Water has a nice heat capacity (I think about 10x as much as copper), but maybe that isn't that important for such a solution as long as the heat gets used. Even if we would just get 10% of the invested energy back, it would be a huge boon already.
Heat capacity doesn't really matter, unless you'll be using the device less time than it takes to reach that capacity. If you have two materials with equal thermal conductivity, but different heat capacity, their cooling properties will be the same once both reach their heat capacity.
One problem with any form of cooling is that you have to get the heat away from the silicon that's generating it and into the cooling system in the first place. For a lot of complex components (like a CPU), that's very hard to do, since there's dozens (or more, in the case of 3d circuits) of layers of heat-sensitive silicon and metal between any given component and the surface of the heat sink.