Minesweeper automates root cause analysis as a first-line defense against bugs

stingraycharles · on Feb 9, 2021

While this is a great strategy for figuring out the cause of a bug, I’d argue that “root cause analysis” in engineering is typically a much more qualitative analysis, and more about high impact failures than mere bug reports.

A more accurate title may be “automatic data collection and analysis for bug reports”; I’m also confident that Microsoft has been doing this exact same thing for at least a few decades.

tehjoker · on Feb 9, 2021

I did like the idea that they recorded low memory conditions. That seems incredibly useful for debugging issues that occur in the wild. A natural next step would be checking GPU memory as well if they haven't already.

Is it possible to measure overall system memory pressure in JS or is that sandboxed?

puttycat · on Feb 9, 2021

I’m pretty confident that they use the same OS data for fingerprinting as well.

hanniabu · on Feb 10, 2021

What special identifiers are there in the OS? And does this also apply to Linux/Ubuntu?

cat199 · on Feb 9, 2021

> and more about high impact failures than mere bug reports.

It's often more about people problems than software problems..

vladd · on Feb 9, 2021

Seems the article is confusing the "trigger" of an event with its "root cause".

I like to give the example of the Concorde airplane crash [0] to exemplify the difference: the incident was triggered by debris on the runway (which caused the tires to explode, igniting the fuel tank above). But the root cause was the placement of fuel in proximity of inflatable tire materials.

[0] https://en.wikipedia.org/wiki/Air_France_Flight_4590

pirocks · on Feb 10, 2021

The Concorde crash is complicated. That narrative leaves out a lot of details, notably: The fuel tanks where filled above the maximum allowable amount. The fuel tanks ruptured above the wing, not where the tire fragments hit. The fuel tanks ruptured because of the shockwave from the tire fragments traveled through the fuel and into the top of the wing. This was a known issue and fixed by always having an air gap between the fuel and top of wings,but the the plane was overfueled in this case. The plane was over maximum takeoff weight, and taking off with the wind behind it. The plane had/may have had an unrelated engine issue with engine #3, leading the flight engineer to shut down engine #3 shortly after takeoff.

The combination of overweight, overfueled, wrong wind direction and 3 out 4 engines operational severely limited Concorde's ability to return to an airport. So I'm not sure you can blame the design decision of putting fuel in the wings(which literally every other relevant passenger aircraft does).

monadic3 · on Feb 10, 2021

While that's a convincing narrative, the idea of a true root cause is fallacious.

KMag · on Feb 10, 2021

Exactly. "Complex systems almost always fail in complex ways." If a system is complex and yet stable enough that it's running in production, it's frequently running in a partially degraded state and at least partially resistant to single component failures. Noticeable failures reaching the level of root-cause analysis almost certainly involve multiple component failures, each with proximate and root causes. Recursively decompose those component failures if the components themselves are complex, and you're left with many "root" causes.

k1t · on Feb 9, 2021

Is there really a difference though?

To me a "trigger" is the initial event that begins a sequence. Isn't that also a "root cause"?

Since everything is connected to everything else, it seems like the point that you decide is the "root" is fairly arbitrary.

It seems you could easily conclude that the root cause was that the runway wasn't cleaned/inspected often enough. Or that the departures were scheduled too close together, preventing such an inspection, etc..

If anything I would say the root cause was the piece of engine cowl falling from the preceding flight - since that seems to be the first thing that "went wrong" in the process.

dathinab · on Feb 9, 2021

Debris on the runway is something to be expected in rare cases, given how many flights there are a day it's just a matter of time until it happens.

As such the problem is an air plan which is designed in a way too prone to cause (too) fatal accidents in certain "rare but guaranteed to happen at some point" situation.

But in the end if you say both are trigger which together lead to the catastrophe or one is a trigger and another is the root cause is indeed irrelevant.

The problem is if you do something I will call trigger analysis but refer to root cause analysis treating it as if it gives you the root cause it can very easily to situations where you fix one of the problems but not all, and potentially not even the biggest problem.

I.e. you make it slightly less likely that there is debris on the runway but you don't fix the problem of the airplane being too prone to certain kinds of catastrophic failure.

azinman2 · on Feb 9, 2021

If I said 'screw you', and that cause you to flip out and kill everyone around you, you couldn't say the root cause was me saying 'screw you'. The root cause might be childhood trauma, extreme emotional imbalance, irrational thoughts, etc. The statement 'screw you' was the trigger.

Similarly here, a trigger might be uploading a photo to FB, but the root cause of an issue might be a bug in encoding JPGs.

breischl · on Feb 9, 2021

One approach to this is the "Five Why's" approach, wherein you ask "why" five times. eg,

Q1: Why did the plan explode?

A1: The engine cowling fell into the engine

Q2: Why'd that happen?

A2: The tire exploded and damaged it.

etc etc.

Obviously the number 5 is arbitrary and not always applicable, it's just a heuristic to get to something "root-ish" without getting to ridiculous distant things like "the laws of physics prevent two objects from inhabiting the same position in space-time".

More generally, defining something as root vs. not is somewhat of a judgement call. Usually you try to find something that will prevent future problems of this sort and call it the root cause. Ideally something that your organization can mitigate with a reasonable time/cost.

Note that the actual mitigation is a separate question. If runway debris is the root cause, then one mitigation is reworking the fuel system. Another would be using tougher tires. Perhaps another would be adding a shield between the tires and the aircraft body. Another might be an automated runway monitoring system that detects debris. etc.

Double_Cast · on Feb 10, 2021

A literal trigger is a lever that fires a gun. A figurative trigger is some event that sets in motion a cascade of events that occur in rapid succession. The root cause of firing a gun would probably be analogous to loading the chamber with a live round. Which may have occurred hours, days, months, etc before the trigger was pulled.

I think the distinction is that the "trigger" relates to a particular instance of failure, whereas the "root cause" relates to a class of failures.

kryogen1c · on Feb 9, 2021

> Is there really a difference though?

yes.

> To me a "trigger" is the initial event that begins a sequence. Isn't that also a "root cause"?

no. root causes are irreducible, hence the word "root".

if someone is endlessly trained for an event and then fails at game time, its probably a root cause. people make mistakes that cannot be avoided (this is why defense in depth is a thing)

if the training program is a 5 second sentence before the event, the persons mistake is not the root cause, its the training program.

spockz · on Feb 9, 2021

To me the placement of the tanks or lack of protection would be the root cause. Because this failure could have been triggered by any other debris as well. So the root cause is the thing that If you fix it, it fixes the problem in a fundamental level.

perl4ever · on Feb 10, 2021

I'm surprised nobody brings up swiss cheese theory here.

qbasic_forever · on Feb 9, 2021

So any concrete stats on how this has helped shorten bug investigation time, improve quality of releases, etc? It looks like an interesting data-driven approach to bug finding but there's curiously no qualitative analysis of how it's actually working in practice. I'd be a little concerned that systems like this can fall into the background as a flurry of noise and process that doesn't actually improve the quality of the product.

haihaibye · on Feb 10, 2021

When naming your software, it's probably not a good idea to re-use the same name as one of the most popular games of all time.

schemescape · on Feb 9, 2021

This seems like a great system for isolating steps to reproduce a bug, but I’m not sure I would consider this “root cause analysis”.

Tarsul · on Feb 9, 2021

It appears to have nothing to do with the game.

KMag · on Feb 10, 2021

Dammit. You just made me lose The Game[0]

[0] https://en.wikipedia.org/wiki/The_Game_(mind_game)

nomy99 · on Feb 9, 2021

typo: Engineering not Enginering

pronoiac · on Feb 9, 2021

I emailed the mods.