An Empirical Analysis of Hardware Failures on a Million Consumer PCs

cs702 · on June 26, 2012

Very useful -- I will take this analysis into account when it's time to upgrade my current personal machine or configure the next one! Thank you for posting this here.

The only thing I would have wanted to see but didn't in this analysis is how failure rates vary for different types of disk subsystem -- specifically, traditional hard drives versus the newer solid-state devices. I suspect, but don't know for sure, that the latter have much, much lower real-world failure rates in the first 30 days of total accumulated CPU time (TACT).

The authors openly suggest that the sharp difference in failure rates between desktop and laptop machines may be due in part to their disk subsystems: "Laptops are between 25% and 60% less likely than desktop machines to crash from a hardware fault over the first 30 days of observed TACT. We hypothesize that the durability features built into laptops (such as motion-robust hard drives) make these machines more robust to failures in general." Alas, the authors don't delve any further into it.

I'd like to see hard data comparing the real-world failure rates of both desktops and laptops using traditional versus solid-state disk subsystems.

wazoox · on June 26, 2012

So far numbers I've seen seem to acknowledge a failure rate of SSDs in the same ballpark as spinning rust but it's only been slightly over a year that most SSDs are actually reliable. Many, many old models were absolutely terrible. Hence I think it may be difficult yet to draw reliable conclusions.

ArbitraryLimits · on June 26, 2012

http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid...

cs702 · on June 26, 2012

wazoox: thanks. Do you recall the source(s) for those numbers?

wazoox · on June 26, 2012

So far, the best source I can recall is this study: http://www.tomshardware.com/reviews/ssd-reliability-failure-...

AngryParsley · on June 26, 2012

That study is Intel-only, but it seems to jive with the return rates Anandtech mentioned: http://www.anandtech.com/show/4202/the-intel-ssd-510-review/...

There are also numbers for hard drives: http://forums.anandtech.com/showthread.php?t=2147063

Basically, Intel SSDs from a year ago are more reliable than all hard drives. And SSDs in general are more reliable than any 2TB hard drive.

The data isn't ideal, but it's better than anecdotes. Return rates should correlate pretty well with failure rates. If anything, return rates should favor hard drives, since people are less likely to return a faulty cheap hard drive than a faulty expensive SSD.

cs702 · on June 26, 2012

AngryParsley: the Tom's Hardware article wazoox mentioned above also has those return-rate stats (page 3): "...returns can occur for a multitude of reasons. This presents a challenge because we don’t have any additional information on the returned drives—were they dead-on-arrival, did they stop working over time, or was there simply an incompatibility that prevented the customer from using the [device]? ... If online purchases account for the majority of hard drives sold, poor packaging and carrier mishandling can have a real effect on return rates. Furthermore, we also have no way of normalizing how customers used these drives. The large variance in hard drive return rates [between data sets] underlines this problem. For example, the Seagate Barracuda LP rises from 2.1% to 4.1%, while the Western Digital Caviar Green WD10EARS drops from 2.4% to 1.2%..."[1]

In short, the available return-rate data is too noisy and inconsistent to be a good proxy for failure rates.

[1] http://www.tomshardware.com/reviews/ssd-reliability-failure-...

cs702 · on June 26, 2012

wazoox: thanks for that. Just read it.

My take: so far, no one has sufficient consistent-across-the-board data at the moment to reach a conclusion about the matter, but the anecdotal evidence presented in that article suggests that Intel SSDs probably have lower failure rates than most traditional and solid-state alternatives. I will keep that in mind next time I buy an SSD.

osivertsson · on June 26, 2012

Since the database they used is from "a period of 8 months in 2008" (Section 4, "Measuring hardware failure rates") I doubt they had any significant number of solid-state disks in their data.

thechut · on June 26, 2012

cs702, this is the article I looked at before buying my SSD recently: http://www.tomshardware.com/reviews/ssd-reliability-failure-...

It's talking specifically about disk failures though, not comparing whole systems

cs702 · on June 26, 2012

thechut: thanks. See my response here: http://news.ycombinator.com/item?id=4162362

dripton · on June 26, 2012

Funny, I suspect that SSDs have much higher real-world failure rates. (My personal, limited, anecdotal evidence is that my 64 GB Crucial M4 SSD lasted about a year as the root drive in a busy Linux desktop, while I have a stack of about a dozen hard drives that have been retired due to being too small or too slow while still working fine.)

Lack of moving parts is great, but flash allows a finite number of write cycles.

chaud · on June 26, 2012

How heavily were you using the laptop? Did it fail from running out of write cycles or something else?

Some people over on xtremesystems have done Endurance testing, and the 64 GB m4 took over 700TB of writing to for failure to occur, and 172 TB to reduce the MWI to 0.

In a little over a year I have only written 4.1TB to my SSD in my desktop. Write cycles are very unlikely to run out for me before I replace the drive.

http://www.xtremesystems.org/forums/showthread.php?271063-SS...

dripton · on June 27, 2012

I don't have actual numbers, but it's my primary desktop at home, and it saw everyday use.

cs702 · on June 26, 2012

dripton: I hear you, but note that I'm not interested in write-cycle limits nor MTBF stats obtained in a lab setting; I'd just like to know the real-world rate of hardware failure of SSDs.

UnoriginalGuy · on June 26, 2012

I think everyone would. Unfortunately there are financial reasons for SSD manufacturers to keep that information secret and contractual reasons why retailers cannot release it.

I think Intel released some information on the reliability of their SSDs a few years ago but that was likely because they knew they were doing best and their enterprise customers are very interested in that for their data-centre rollouts.

The very limited information I've seen suggests to me that a few years ago SSDs had a much higher failure rate than HDDs (double in the first 6 months) but that has been falling very quickly with each new generation of SSDs (and as the profit margin grows and manufacturers have to work on reputation to justify the markup).

mrb · on June 26, 2012

When Microsoft, Google, or some university publish analysis of hardware failures across large numbers of machines, they always anonymize hardware vendors ("vendor A", "vendor B").

I understand the reasons (not alienating your hardware vendors), but will there ever be a research group who will disclose vendor names? Heck, I would pay for this information.

wmf · on June 27, 2012

IMO such information would do more harm than good. By the time they could gather statistics, that model would be obsolete so the stats wouldn't help you buy new equipment. But the kind of people who are still griping about the Deathstar would use the information to troll non-stop.

boxein · on June 27, 2012

BUT if we had consistent poor performance for some vendor in a certain category, we could infer that their offerings for that category will also be poor in the future.

rrreese · on June 27, 2012

This is dated, only applies to laptops, but it breaks down failures by manufacturer: http://www.electronicsweekly.com/blogs/engineering-design-pr...

wazoox · on June 26, 2012

Among other interesting insights:

* a machine that crashed once is 100 times more likely to crash again; the more it crashes, the more it's prone to fail again.

* overclocking significantly reduces reliability. One CPU vendor (AMD or Intel, but unspecified) is much worse in this regard, too.

* conversely, underclocking improves reliability.

* branded computers are more reliable than beige boxes.

* laptops are more reliable than desktops.

_juof · on June 26, 2012

>> branded computers are more reliable than beige boxes.

This is meaningless.

The real question is how those compare to beige box that uses a decent parts. But Microsoft definitely has interest in helping manufacturers of brand computers because piracy is more prevalent in beige boxes.

scott_s · on June 26, 2012

This paper was written by researchers in Microsoft Research, and accepted in an academic conference. It is not marketing. To be clear, you are suggesting that the researchers were dishonest in order to help their company. I find this unlikely.

Disclaimer: I am a researcher in a corporate lab.

_juof · on June 26, 2012

I doesn't have to be dishonesty. It might just be bias(conscious or subconscious) not to waste time on this question.

Even doctors show this kind of biases when advising people on choosing treatments(which is a much bigger moral issue).

scott_s · on June 26, 2012

Their conclusions on "white boxes" are based on relatively straightforward statistical analysis of their data. In order for there to be bias against white boxes, one of the following has to be true:

1. Their data collection methods are biased against white boxes. Given the large sample size and the method of retrieving samples - automatically generated crash reports from users - I find this unlikely. They cover this point in section 3.2.1.

2. Their statistical analysis is flawed. I see no issues with it, nor did the reviewers. (Otherwise it wouldn't have been accepted.)

3. They lied. I am most skeptical on this one.

It's disingenuous to gesture at researchers and allege bias based on their employer without actually saying how they are biased. Doing so is not valid skepticism, but prejudice.

dredmorbius · on June 26, 2012

My suspicion is that branded hardware manufacturers are uniformly reasonably good in quality. White box vendors may vary widely: some are good, but some (many? few?) are really, really bad. This can skew data.

It's also possible that it's getting more difficult to accurately spec systems, to enforce vendor quality (Dell gets a bad batch of drives, they can 1) detect it and 2) tell the vendor to stuff it, Ahmed's Boxez'R'Us may not have that leverage or depth of experience), and to do burn-in testing of their own systems.

That said, I've had good and bad experiences with big-name and white box vendors alike.

flomo · on June 27, 2012

One question is what happens to a component that fails a major OEM's QC standards, where does it go? In to the garbage? Or into the whitebox channel?

For example, some have speculated that "Gamer RAM" with mean looking heatsinks is actually poorer quality stuff that requires additional cooling to work correctly.

_juof · on June 26, 2012

The problem is that they didn't wrote in their paper :"white boxes are known to have large variability in their level of reliability. We leave it to further study to compare the reliability of white boxes built with quality components to brand boxes".

Do you think they haven't known that ?

This kind of remarks of the incompleteness of research done is common in many research papers. And they definitely contribute to readers not getting the wrong impression.

scott_s · on June 26, 2012

The problem with your statement is that it's based on anecdote. From their related work section: "The effect on failure rates of overclocking and underclocking, brand name vs. white box, and memory size have not been previously studied." You assume that white boxes have a large variability in reliability based on your personal experience and anecdote. But, according to the authors, there is no systematic study backing that up. "Quality components" is similarly difficult to pin down.

The authors stated a conclusion, but did not speculate on the cause behind the conclusion. I see no bias in them not calling attention to the fact that they have not studied the cause - that is self-evident.

mc32 · on June 26, 2012

Perhaps the difference is in testing.

A large mfg, I would imagine, would test a configuration repeatedly before making it available, and then, once approved, the individual systems would go through burn-in, probably with more rigor than beige boxes. So even beige boxes with pricier (but unproven configurations) might suffer from grater failure rates. In addition, large mfgs might be able to demand better "lots" from their parts mfgs/oems.

Just a thought.

JoeAltmaier · on June 26, 2012

Dell tests nothing. Parts in one door, assembly, shipped out other door to consumer.

Essentially you the consumer are doing the burn-in. Its cheaper for Dell to replace failed machines. The cost to burn-in (and the time!) is large.

JoeAltmaier · on June 27, 2012

I was wrong! A former Dell employee tells of touring a plant and seeing the test station - hydra-like cables dangling from the ceiling with 1 plug for every hole in the computer. It would get plugged in, network-boot diagnostics and run for some time before being passed. But this was 12 years ago...

glhaynes · on June 26, 2012

Would you be happier with it if it said "a randomly-chosen branded computer is, on average, likely to be more reliable than a randomly-chosen beige box"?

kahawe · on June 26, 2012

Taking a stab in the dark here but I am wondering how important users messing with their boxes are there.

I assume more "beige boxes" are either bought together from cheap components and been assembled by the users themselves or are generally cheap noname buys assembled by who-knows in the shop or maybe they have been modified and/or over-clocked by enthusiasts thus making it more likely to fail - whereas the typical users who buy "brand name" PCs or laptops are not going to mess with them and they can rely on at least SOME standardized quality assurance and control.

osivertsson · on June 26, 2012

I agree with you that many factors need to be considered here that isn't mentioned in the paper. But they mention in section 5.3, "Brand name vs. white box", that "to avoid conflation with other factors, we remove overclocked machines and laptops from our analysis."

ChrisNorstrom · on June 26, 2012

I'm having a hard time coming to terms with "Laptops less likely to crash from hardware fault than desktops"

Everything we've learned from experience, surveys, and PC World magazines has showed the opposite. Heat kills hardware and laptops have their hardware packed together so closely that it generates lots of heat. Back then I remember reading something like 1 in 4 laptops fail in the first 3 years. Which was very believable, at the time I was in collage for game design & development. All 80 guys in our class had laptops from HP (with get this... Pentium 4s in them). Those laptops had a LOT of problems. They were basically portable heaters.

So I guess laptops now have either much better cooling, much cooler CPUs or a combination. OR PCs are just terribly cooled.

Lagged2Death · on June 26, 2012

This analysis is making a very specific measurement. They're only counting software crashes that are detected and reported in Windows logs.

If a machine crashes so severely that a crash report is not generated, than those reports will not be present in our data. Therefore, our analysis can be considered conservative...

"Conservative" here means "it underestimates the crash rate by some unknowable amount."

If you drop your laptop down a flight of stairs, it will likely develop some hardware problems. But for one thing, not all hardware failures will cause software crashes and crash reports. For another thing, you're likely to just replace the drop-kicked laptop, and the hardware failures will never appear in logs like these.

mertd · on June 26, 2012

Here is the authors' hypothesis:

We hypothesize that the durability features built into laptops (such as motion-robust hard drives) make these machines more robust to failures in general. Perhaps the physical environment of a home or of- fice is not much gentler to a computer as the difference in engineering warrants.

TazeTSchnitzel · on June 26, 2012

Just a theory, but perhaps laptops have a shorter lifespan in general?

mrb · on June 26, 2012

No. They only took into account the machines' first 30 days of service life:

"we only count failures within the ﬁrst 30 days of TACT (total accumulated computing time), for machines with at least 30 days of TACT."

dkarl · on June 26, 2012

That's an interesting point. Laptops get replaced faster, and when a laptop is replaced, it is typically left in a non-running state in which it will not experience or report errors. Desktops are often left running for months or years after they've been replaced, either through laziness or because of a desire to keep them around for a special purpose.

josephturnip · on June 26, 2012

Interesting stuff. You can improve reliability by running your system at a lower speed. Here's a blog post with a summary of some of the conclusions of the paper above: http://grano.la/blog/2012/06/improve-the-reliability-of-your... (Disclaimer: that's my company's blog)

One question I still have is whether the switching of CPU frequencies has any effect, or if it is only the average speed that correlates to the reliability. Anecdotal evidence suggests that this is the case, but it could be an area for further research.

kristaps · on June 26, 2012

Interesting, too bad the power supplies could not be controlled in their setup, as a wonky power supply can unleash all kinds of gremlins that look like failures in components down the line.

Avitas · on June 26, 2012

I have been telling my staff for probably 15 or so years that likeliest cause of PC failure is (in the following order):

- Power supply - Hard Drive

Ranking near these are sleeve bearing fans with ball bearing fans a close second--for CPU and case cooling.

In rough order, I would say that the following is my estimation of other common component failure sources:

- Removable Drives (floppy, optical, etc.) - Video Card (if separate) - Motherboard - RAM - CPU

These are for our corporate PCs which have been Compaq, Dell, IBM, HP, Lenovo and a few other brands.

For our brand name and whitebox server hardware, it's pretty much the same... if something is going to fail, it's going to be a power supply or a hard drive. In fact, I don't ever remember a single server motherboard, RAID controller, RAM stick, CPU or other component ever going bad in a server.

I wonder why they would leave out statistics relating to power supplies when they are, in my experience, the component with the greatest failure rate.

slug · on June 26, 2012

Most of my desktop/server computer failures or abnormal behavior are due to power supplies not working properly, even when protected by a decent UPS.

Zenst · on June 26, 2012

Interesting read though why can't Microsoft just tell me that my CPU or HD or memory is borking and suggest I RMA it instead of saying everytime - have you applied the latest updates, which I get to click unhelpful.

Most important thing in a PC I have found for reliability above everything else is a good PSU, realy does make a difference on the hardware side as you give your kit cleaner power. Add UPS/surge protector and you can double the lifetime of kit. Least from experience I've had it has been noticable.

Hoff · on June 26, 2012

The copy at Microsoft Research is offline.

Here's another copy of the paper:

http://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nighti...

rdmirza · on June 26, 2012

Mirror. (For people who tried to ctrl-f and found nothing, like me.)

acqq · on June 26, 2012

There are a lot of insights in the paper, but I'd really like to know about this:

"The table shows that CPUs from Vendor A are nearly 20x as likely to crash a machine during the 8 month observation period when they are overclocked, and CPUs from Vendor B are over 4x as likely"

Obviously it's 5 times difference in probability to have unstable system if overclocked between Intel and AMD but they don't say which one is better. Anybody knows?

reitzensteinm · on June 26, 2012

As a stab in the dark, I'd say that it's Intel's chips that perform more stably while overclocked, all else equal, because Intel comfortably holds the performance crown, and is thus very conservative in their binning.

Right now, Intel's fastest desktop chip is an i7 990X, which is $1,029 on Newegg. AMD's is an FX-8150, at $199.

Intel prices pretty fairly against AMD on the price/performance curve where AMD has a competitor, eg, the Core i5 3550 at $209 generally outperforms AMD's fastest chip. Pricing then soars off into the sky.

Which is to say, if AMD bumps the speed of their CPUs, they'll release a faster product, compete better against Intel (until Intel reacts), and make more money. If Intel bumps its speeds, they're competing against nobody but themselves, so they usually don't bother.

Therefore, you usually see Intel quite conservatively binning their chips, and they have a lot of headroom. It's not unusual to have an AMD chip that can't go 200mhz faster on air cooling, and to have an Intel chip that can go well in excess of 1ghz faster. So all else equal, bumping Intel chips is less likely to be an issue.

Now, there are some other factors at play here. Firstly, hardcore overclockers pretty much only buy Intel chips. Also, people that are serious often turn up the speed until right before the moment at which the chip starts getting SuperPI errors (i.e. errors at nearly maximum load).

But my totally uninformed gut feeling is that the majority of overclocks aren't done that seriously (if they are, it could actually reverse this analysis).

If we assume that, turning the knob up on an Intel chip without sophistication is much less likely to end badly. The crashes in their paper are not very frequent on average (months between), which isn't necessarily bad enough to revert to old CPU speeds even if the user knows that's what is going on.

zipdog · on June 26, 2012

When Google did a massive analysis of hard-drive failures they also didn't publish manufacturer names, because they felt that it would tarnish the company name when it might just be a production run

starpilot · on June 26, 2012

Actually, they were just keeping their cards close to their chest:

> However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data.

http://research.google.com/archive/disk_failures.pdf

acqq · on June 26, 2012

Yes, such statistics would have to be limited to the exact models analyzed, I can imagine the results in the next generation of the products turning other way around.

hollerith · on June 26, 2012

The result most surprising to me is that laptops are between 25% and 60% less likely than desktop machines to crash from a hardware fault during the first 30 days worth of measurements.

The much larger weight and volume of desktops would seem to make them easier to cool.

josephturnip · on June 26, 2012

That caught my eye too.. perhaps it's because laptops are designed with being carried around in mind? Or, to draw on the other conclusions from the paper, because laptops generally have lower-frequency chips (and for that matter, slower memory and disks too)?

Zenst · on June 26, 2012

Laptops also have in effect a cleaner power source with UPS built in so any power glitch's get filtered on a laptop as standard compared to a desktop.

Least I have found desktop's with a good quality PSU and UPS noticable more reliable than those without.

jseliger · on June 26, 2012

Laptops also have in effect a cleaner power source with UPS built in so any power glitch's get filtered on a laptop as standard compared to a desktop.

I've heard this theory before but never seen it substantiated. Do you have a link to any research or articles that explains how this works?

kyberias · on June 26, 2012

How it works? Laptops have a battery. Hence they are less likely to power off suddenly when mains voltage drops. And powering off suddenly increases the probability of some electronics in the system failing. No need to research, just simple electronics and logic.

lostlogin · on June 26, 2012

Pull your laptops power cord out the wall, and plug it back in. Laptop unchanged. Dirty power supply, but laptop unaffected.

__alexs · on June 26, 2012

Interesting that they knew that CPU speed resulted in more crashes but didn't test this when it came to laptops vs desktops. If only they had released their data :(

hollerith · on June 26, 2012

Too bad CPU temperature was not part of the collection of data used in the study.

latch · on June 26, 2012

Is there a compelling reason for this to be a PDF rather than HTML? I'm genuinely curious.

scott_s · on June 26, 2012

As mbafk and josephturnip state, they simply put online the same copy that was published in the conference. Academic conferences typically publish papers in PDF form.

But, that doesn't actually answer your question, which I think is "Barely." I feel silly preparing PDFs for publication when I know that most people will read it on their computer, not print it out. Many conferences no longer even have an actual, physical copy of the proceedings, instead just giving out USB sticks with all of the PDFs. (Which is what we want anyway.)

I think it would be fantastic if there was a standard HTML5 template that researchers could use to publish their papers. There are Latex-to-HTML compilers, but I've never been impressed with the results. I think people outside of academia would be more likely to read our papers if they were in HTML rather and PDF.

Dn_Ab · on June 26, 2012

I am outside academia but I read a lot of papers. For most people outside I suspect the biggest problem is paywalls and not format. If you don't read a paper because it is a PDF and not HTML then you weren't really interested. With chrome, it is not even an annoyance as when you had to load adobe.

One reason for PDFs is as you mentioned, latex to HTML results are typically poor. Diagrams are another difficulty that don't have an easy HTML solution. Other reasons I prefer PDFs are: though I never print, I often save papers to disk since I don't always find the paper when I go searching the second time (especially if it is months or even years after), there is a real benefit to being able to read a paper offline - I don't always have a connection to the net when I want to read and lastly, if you have an ereader such as the kindle, pdfs render well on them.

scott_s · on June 26, 2012

Paywalls are probably a bigger issue, but it's still friction. Anyway, modern browsers give the option to save all of the images along with the HTML file.

mbafk · on June 26, 2012

I agree it would be great if there was a good HTML template the IEEE or ACM used. Even using Latex you spend ages making it look exactly the way you want and don't want to upload a dodgy HTML version.

scott_s · on June 26, 2012

The major difficulty, I think, are figures. How can we maintain keeping figures near the text that talks about them, and also allows a high density of them? The best I can come up with is a narrow column of text on the left (similar in size to a column of text in an ACM or IEEE double column format), and a larger column on the right with figures. The difficulty is in anchoring the figures in the right column with the text. But then you may have a bunch of figures piled up in one place, and a sensible layout becomes difficult to do.

mbafk · on June 26, 2012

I guess they just uploaded the final version they published in EuroSys 2011. Like all ACM conferences I know about, PDF is the way it is done. They would have to reformat for HTML.

josephturnip · on June 26, 2012

Most research papers are disseminated as PDF in my experience. This was a EuroSys 2011 paper I believe.

andreasvc · on June 26, 2012

Mathematical formulas are a big reason; wikipedia uses images which look passable unless you start zooming in. Another reason is that with HTML you're never sure if you'll end up with the fonts as they were intended. Lastly, because HTML is not a page-oriented markup format, the presentation in general can differ from system to system; e.g., you can't say "on line 2 of page 3".

scott_s · on June 26, 2012

You may not have seen MathJax: http://www.mathjax.org/ The main drawback is that I noticed a delay before it fills in the correct math notation. But the results are quite impressive.

stcredzero · on June 27, 2012

I've always said that smart hardware tinkerers underclock. It produces less heat, and results in a quieter machine. I always suspected it improves reliability.

wmf · on June 27, 2012

Or you could save money and just buy a lower bin.

stcredzero · on June 27, 2012

Well, because heat dissipation is proportional to the square of the voltage, you end up giving up a little bit of performance but save a whole lot of heat. In experiential terms, you never miss performance but often notice a whole lot less fan noise.

Buying from a lower bin, you're getting a crappier processor, which might give you less latitude to save heat. This would probably be worth measuring and writing an article about. Also, I tend to buy lower clocked processors as it is.

wmf · on June 27, 2012

I think it's likely that all lower bins are artificial, so e.g. underclocking a 2.4 GHz down to 2.0 is probably exactly the same as if you bought the 2.0. But yeah, it would be worth measuring.

stcredzero · on June 27, 2012

Ah, I see. I wrote underclock. It's really undervolting that gets you the big win thermally. Underclocking should be done just as a means of achieving a greater undervolt. I just have these two things in the same mental bin.

wmf · on June 27, 2012

When you underclock properly (with SpeedStep) it also lowers the voltage... probably to the same voltage that the lower-bin processor would use.

stcredzero · on June 27, 2012

There's not just one voltage here. It's a curve. I suspect that the better the processor turned out, the more favorable your curve turns out to be.