Real World Examples of GPT-3 Plain Language Root Cause Summaries

solidasparagus · on March 25, 2021

This is interesting, but I'm not sure I'm a fan. One of the core problems in building businesses out of AI is that people pick problems they want to solve and try to get AI to solve them rather than picking problems that AI is likely to be able to solve. AI is unreliable, particularly in the long-tail, and cannot easily be used in domains where the cost of failure is high. Using AI to accelerate humans via human-in-the-loop AI tools is a great example of where things like GPT-3 can have real value.

But good root cause analysis is a matter of establishing facts and building a chain of logic on top of those facts to get to the root cause. You cannot rely on models like GPT-3 to give you reliable baseline facts. Particularly when you are talking about a production issue that needs to be fixed ASAP. The key line that worries me from this blog post is "when results are suboptimal, they are mostly not misleading". 'Mostly not misleading' isn't going to cut it when I'm in the middle of an outage. I think that will prove to be a problem if this tool gets widespread usage.

That being said, I'm a huge fan of applying AI to human-in-the-loop problems and this was a cool idea for how modern language models can be applied.

cl42 · on March 25, 2021

I'm so with you on this. First, you don't need to turn root cause analysis into text like this via GPT-3; there are easier ways.

Secondly, I imagine most cases of "root cause analysis" require you to be very, very clear in understanding the... root cause... So using generalized language models will probably lead to unacceptable errors, which means there are probably better ways of addressing this problem (as per the discussion here on error rates and unacceptable errors in ML-products: https://phaseai.com/resources/how-to-build-ml-products)

Ajs1 · on March 25, 2021

fair point, but this blog and your comment is about a summary sentence. if you read through to how the underlying log reports are constructed, those are very accurate (and also quite concise).

mlthoughts2018 · on March 26, 2021

I think your perspective is actually way off the mark here. AI excels at long-tail problems where the cost of failure is high, precisely because human failure is such an expensive problem in those cases and the nature of long tail problems prevents it from being possible to apply QA to every use case. In other words, you know you are forced to deal with getting it wrong a lot and paying the high failure cost, so using a system capable of optimizing that trade-off explicitly is often much better than pretending as if a human in the loop is somehow sparing you the failure costs when they aren’t (and in fact they are simply less efficient than algorithmic solutions).

What constitutes a useful sequence of facts in root cause analysis is not just some platonic existing thing. It’s a complex problem involving mind-melting log sleuthing, correlating all kinds of disparate metrics, comparing against timestamps of merges and eventually synthesizing the results.

Even seasoned veterans who know systems inside and out struggle with the sheer volume of logs, metrics and facts to compile. And most of the time their approach is based purely on inductive experience with similar incidents combined with heuristics.

This is precisely the kind of problem that ML solutions excel at. It has many hallmarks of a good fit and almost none of the hallmarks of “solution in search of a problem” ML over-engineering.

solidasparagus · on March 27, 2021

A couple of things here.

First, you are conflating the underlying log relevance scoring ML system with the GPT-3 summarizing system. ML is a good fit for relevant log identification for the reasons you describe, although characterizing this software as root cause identification is not very accurate in my opinion, based on the examples you can find on their website. But the value of summarizing a log line into natural language is low, while the cost of misleadingly characterizing that log line is high. Whoever needs to debug this system and find the real root cause (e.g. why did the system go OOM?) probably needs certainty more than the convenience and in all likelihood, they are more likely to correctly summarize what the log line says than GPT-3 is (obviously we don't know since there is no evidence, but I don't work with any engineers whose ability to summarize the contents of a log line would be described as "mostly not misleading").

Secondly, I can't agree with this sentence:

> AI excels at long-tail problems where the cost of failure is high, precisely because human failure is such an expensive problem in those cases

Maybe it depends on domain and tech, but in my experience humans don't fail on out-of-sample data nearly as often as AI does. When they do fail, it is often more predictable to other humans and humans inherently have the ability to assign confidence levels to their conclusion which you don't see in many AI models such as GPT-3. Humans are also more effective at applying rules (e.g. common sense) to improve predictions on out-of-sample inputs. I think of "AI is worse than humans at generalizing to out-of-sample" as being a widely held, well-evidenced belief, but I would be interested if you disagree.

For me, the quintessential example is something like traffic light identification, where models generally struggle to identify unseen variants correctly while humans rarely struggle at it. What examples are you thinking of where AI excels at long-trail problems?

IgorPartola · on March 26, 2021

Can it be both? AI excels at doing repetitive things better than humans, like maybe driving a car. Until it encounters a situation that it hasn’t seen before that cannot accurately be described with bits and pieces of what it knows. Then it’s result isn’t a little off the mark, it’s a lot. Think of a disaster like what happened with Ever Given or Chernobyl. How many different ways could AI have made the situation worse when confronted with those problems because there is no good definition of an optimal solution here.

mlthoughts2018 · on March 26, 2021

Generalization to previously unseen examples is one of the core components of ML models.

draugadrotten · on March 25, 2021

The human-readable text is very nice. However, these are not Root Causes:

* The root cause of the issue was that the Jenkins master was not able to connect to the vCenter server. ==> Why was it not?

* The root cause was a drive failed error ==> Why did the drive fail?

* The root cause was that the Slack API was rate limited. => Why was it rate limited?

These exampled from the article may be human-readable errors, but that doesn't make them root causes.

To have a root cause analysis, try asking Why five times. https://en.wikipedia.org/wiki/Five_whys

Ajs1 · on March 25, 2021

good point. This is the unfiltered response from the GPT-3 prompt, and the phrase "root cause" is a bit of an overstatement by GPT-3. However the collection of log events that are in the actual reports are far more descriptive. You can find examples here: https://www.zebrium.com/blog/using-gpt-3-with-zebrium-for-pl... and here: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an...

mlthoughts2018 · on March 26, 2021

“Root cause” is relative. To the CEO the root cause is “some engineering thing broke.” To the data engineer the root cause may be that a human config error led to a rogue process that caused a VM disk to fill up. To a quantum super-intelligence the root cause may be that in Everett Branch 2765425 atom 67896533 collided with atom 78532578.

(Just kidding. The space of Everett branches is not countable.)

londons_explore · on March 25, 2021

This kind of output is what the logs should have said in the first place...

I wish projects like the linux kernel would work on making log messages, at least those for common events, more readable to an engineer who isn't familiar with kernel internals.

FBT · on March 25, 2021

When it comes to the systemd logs, this is kind of what the -x flag to journalctl does (or tries to do.)

Having detailed human-level descriptions of what's going on and how to fix it is great. But you also don't want to drown out any important details under waves of verbose text.

The solution, then, is to show the extra detail only when it's requested with the -x flag.

This works pretty well, all things considered. The detailed messages are fine, but they could be better—but that's probably always going to be true. It's a start, anyway.

Loke123 · on March 25, 2021

:) I'm sure you're not the only one to wish that.

travisjungroth · on March 25, 2021

This is the first use of GPT-3 to give me an "oh shit" reaction. Turning verbose, structured messages into something more human readable is a huge problem space. Seeing one instance of it working makes me think there will be many more.

Loke123 · on March 25, 2021

Good to hear we're not alone in thinking that this is a promising use case.

hooande · on March 25, 2021

The set of people who need to know the root cause of a problem but aren't familiar with reading logs seems like it might be pretty small. I find that as a developer I only need something that a few times and then I have a sense of what log messages mean what. This seems like it might be valuable for a tech executive to get a quick feel for why categories of errors are occuring, but I would generally have someone compile and summarize that data for me.

Like most people, I love the idea of ingesting large amounts of data and making it readable. I guess what I would want, personally, is more like a gpt-3 powered stackoverflow search where I can put in an arbitrary cryptic error and get a human readable, root cause based explanation. This is a very interesting use case and I hope they continue to develop it.

dcolkitt · on March 25, 2021

A similar approach might be pretty useful for C++ template errors and other notoriously complex compiler errors that you tend to get with higher-order type systems.

Tossrock · on March 25, 2021

This could be an amazing IDE plugin.

caseyross · on March 25, 2021

Fantastic job - this is probably the only time I've seen a "code replacing humans" tool and thought "this could actually work".

I think a lot of commenters here are not quite getting the use case for these kind of summaries. As I see it, it's not that an automatic summary is more accurate, or more complete, than an investigation by a human engineer - it's that the summary lets you resolve issues much faster and with less of a requirement to remember obscure implementation details.

Ajs1 · on March 26, 2021

appreciate your comment - spot on!

Loke123 · on March 25, 2021

This is a follow up to an earlier post describing the use of GPT-3 to summarize log events that describe software problems. https://news.ycombinator.com/item?id=25749820

This post shares examples of real summaries generated during beta tests, as well as examples of some sub-optimal outcomes.

Animats · on March 25, 2021

Microsoft had something similar in the Windows 7 era. They have a crash dump analyzer that produces long reports. They fed those into a classifier, to group similar dumps together. Then all the dumps grouped together were given to one person to find the common bug.

renewiltord · on March 26, 2021

Very cool. It's like an advanced problem search tool. I like it. This is phenomenal.

jeffbee · on March 25, 2021

You're only supposed to have novel outages. If you can train a machine to summarize outages, you might be doing it wrong.

Loke123 · on March 25, 2021

just to clarify - our machine is unsupervised, so it learns the normal for any application, and identifies log sequences that are novel for that application. We then turn around and feed it to GPT-3, which indeed tries to match on existing data sets in the public domain. So while the problem indeed has been documented by someone else in the world, it is still novel for that particular application.