While this is a great strategy for figuring out the cause of a bug, I’d argue that “root cause analysis” in engineering is typically a much more qualitative analysis, and more about high impact failures than mere bug reports.
A more accurate title may be “automatic data collection and analysis for bug reports”; I’m also confident that Microsoft has been doing this exact same thing for at least a few decades.
I did like the idea that they recorded low memory conditions. That seems incredibly useful for debugging issues that occur in the wild. A natural next step would be checking GPU memory as well if they haven't already.
Is it possible to measure overall system memory pressure in JS or is that sandboxed?
Seems the article is confusing the "trigger" of an event with its "root cause".
I like to give the example of the Concorde airplane crash [0] to exemplify the difference: the incident was triggered by debris on the runway (which caused the tires to explode, igniting the fuel tank above). But the root cause was the placement of fuel in proximity of inflatable tire materials.
The Concorde crash is complicated. That narrative leaves out a lot of details, notably:
The fuel tanks where filled above the maximum allowable amount.
The fuel tanks ruptured above the wing, not where the tire fragments hit. The fuel tanks ruptured because of the shockwave from the tire fragments traveled through the fuel and into the top of the wing. This was a known issue and fixed by always having an air gap between the fuel and top of wings,but the the plane was overfueled in this case.
The plane was over maximum takeoff weight, and taking off with the wind behind it.
The plane had/may have had an unrelated engine issue with engine #3, leading the flight engineer to shut down engine #3 shortly after takeoff.
The combination of overweight, overfueled, wrong wind direction and 3 out 4 engines operational severely limited Concorde's ability to return to an airport. So I'm not sure you can blame the design decision of putting fuel in the wings(which literally every other relevant passenger aircraft does).
Exactly. "Complex systems almost always fail in complex ways." If a system is complex and yet stable enough that it's running in production, it's frequently running in a partially degraded state and at least partially resistant to single component failures. Noticeable failures reaching the level of root-cause analysis almost certainly involve multiple component failures, each with proximate and root causes. Recursively decompose those component failures if the components themselves are complex, and you're left with many "root" causes.
To me a "trigger" is the initial event that begins a sequence. Isn't that also a "root cause"?
Since everything is connected to everything else, it seems like the point that you decide is the "root" is fairly arbitrary.
It seems you could easily conclude that the root cause was that the runway wasn't cleaned/inspected often enough. Or that the departures were scheduled too close together, preventing such an inspection, etc..
If anything I would say the root cause was the piece of engine cowl falling from the preceding flight - since that seems to be the first thing that "went wrong" in the process.
Debris on the runway is something to be expected in rare cases, given how many flights there are a day it's just a matter of time until it happens.
As such the problem is an air plan which is designed in a way too prone to cause (too) fatal accidents in certain "rare but guaranteed to happen at some point" situation.
But in the end if you say both are trigger which together lead to the catastrophe or one is a trigger and another is the root cause is indeed irrelevant.
The problem is if you do something I will call trigger analysis but refer to root cause analysis treating it as if it gives you the root cause it can very easily to situations where you fix one of the problems but not all, and potentially not even the biggest problem.
I.e. you make it slightly less likely that there is debris on the runway but you don't fix the problem of the airplane being too prone to certain kinds of catastrophic failure.
If I said 'screw you', and that cause you to flip out and kill everyone around you, you couldn't say the root cause was me saying 'screw you'. The root cause might be childhood trauma, extreme emotional imbalance, irrational thoughts, etc. The statement 'screw you' was the trigger.
Similarly here, a trigger might be uploading a photo to FB, but the root cause of an issue might be a bug in encoding JPGs.
One approach to this is the "Five Why's" approach, wherein you ask "why" five times. eg,
Q1: Why did the plan explode?
A1: The engine cowling fell into the engine
Q2: Why'd that happen?
A2: The tire exploded and damaged it.
etc etc.
Obviously the number 5 is arbitrary and not always applicable, it's just a heuristic to get to something "root-ish" without getting to ridiculous distant things like "the laws of physics prevent two objects from inhabiting the same position in space-time".
More generally, defining something as root vs. not is somewhat of a judgement call. Usually you try to find something that will prevent future problems of this sort and call it the root cause. Ideally something that your organization can mitigate with a reasonable time/cost.
Note that the actual mitigation is a separate question. If runway debris is the root cause, then one mitigation is reworking the fuel system. Another would be using tougher tires. Perhaps another would be adding a shield between the tires and the aircraft body. Another might be an automated runway monitoring system that detects debris. etc.
A literal trigger is a lever that fires a gun. A figurative trigger is some event that sets in motion a cascade of events that occur in rapid succession. The root cause of firing a gun would probably be analogous to loading the chamber with a live round. Which may have occurred hours, days, months, etc before the trigger was pulled.
I think the distinction is that the "trigger" relates to a particular instance of failure, whereas the "root cause" relates to a class of failures.
> To me a "trigger" is the initial event that begins a sequence. Isn't that also a "root cause"?
no. root causes are irreducible, hence the word "root".
if someone is endlessly trained for an event and then fails at game time, its probably a root cause. people make mistakes that cannot be avoided (this is why defense in depth is a thing)
if the training program is a 5 second sentence before the event, the persons mistake is not the root cause, its the training program.
To me the placement of the tanks or lack of protection would be the root cause. Because this failure could have been triggered by any other debris as well. So the root cause is the thing that If you fix it, it fixes the problem in a fundamental level.
So any concrete stats on how this has helped shorten bug investigation time, improve quality of releases, etc? It looks like an interesting data-driven approach to bug finding but there's curiously no qualitative analysis of how it's actually working in practice. I'd be a little concerned that systems like this can fall into the background as a flurry of noise and process that doesn't actually improve the quality of the product.
A more accurate title may be “automatic data collection and analysis for bug reports”; I’m also confident that Microsoft has been doing this exact same thing for at least a few decades.