Are you referring to the "(100*drive-failures)/(drive-hours/24/365)"? There's no multiplication of total # of drives and # of failures in there.
It's all just a scaling: you have a number of broken drives in a corner of the datacenter in the wire bucket that says "broke during 2015", you count them, divide by total hours of that type of disk running (since they may have been brought in commission at different points), and then scale it so you get it in percent-per-year, not likelihood-per-hour.
It smells of someone explaining code, rather than illustrating an important engineering formula, but there's nothing wrong with the rescaling calculation per se.
> Are you referring to the "(100drive-failures)/(drive-hours/24/365)"? There's no multiplication of total # of drives and # of failures in there.
Perhaps the problem is the specific example given. 100 is the size of the drive fleet and also the multiplier required to convert to percentages. Let's assume you are right and the 100 in the equation is not #drives.
Even so, I find the approach questionable. If the point is calculate the proportion of failures then that (overly simplistic) calculation is:
But this isn't what's calculated. Instead the author calculates the proportion of drive-years per annum affected by failure. For the 100 drives in the example the cumulative number of operational hours given in 2015 is 750K hours (out of a possible 876K hours, had the drives been operating 100% of the time).
That's a problem because 750 / 876 = 85.6% of total time.
5 / 85.6 = 5.84% "failure rate" which seems to me an overstatement.
The problem gets worse as the number of operational hours decrease. Imagine for a moment the 100 drives only operated 50% of the time in 2015. We have:
100 (5 / ((875K*0.5) / 875K)) = 10% "failure rate". This despite only 5% of the drives having failed.
It's all just a scaling: you have a number of broken drives in a corner of the datacenter in the wire bucket that says "broke during 2015", you count them, divide by total hours of that type of disk running (since they may have been brought in commission at different points), and then scale it so you get it in percent-per-year, not likelihood-per-hour.
It smells of someone explaining code, rather than illustrating an important engineering formula, but there's nothing wrong with the rescaling calculation per se.