Hacker News new | past | comments | ask | show | jobs | submit login
Simple Anomaly Detection Using Plain SQL (hakibenita.com)
512 points by harporoeder on Jan 11, 2021 | hide | past | favorite | 47 comments



I've implemented something very similar at work, this was a nice write-up. Biggest differences we're doing is to use Welford's algorithm[0] to calculate a running variance, so we can calculate anomalies in real time, without the need to store logs. It works quite well.

[0]: https://en.m.wikipedia.org/wiki/Algorithms_for_calculating_v...


Same, the difference I had in my implementation was an added an 8 week moving average. This worked for most of the year and I think even holidays. What was considered normal for the week of Thanksgiving would not be considered normal for a summer week. This was relatively easy using SQL CTEs and windowing functions.

I also separately used a package in R called bsts (Bayesian Structural Time Series) for a different way of projecting seasonality on a trend to find an acceptable normal range. If the actual fell out of range then it was an anomaly. Great write up on the technique by Kim Larsen here: https://multithreaded.stitchfix.com/blog/2016/04/21/forget-a...


This is a nicely done article. One of the issues with using a z-score for count data is it assumes the data are normally distributed. This implies values < 0 are possible. However, since counts are lower-bounded at zero, they’re asymmetric unlike the normal. This means that practically this method would produce too many false positives for most applications. Whatever threshold is chosen, is likely to have too narrow an envelope between false negative and false positive to be reliable for count data. However, for normally distributed data, this would work perfectly.


> One of the issues with using a z-score for count data is it assumes the data are normally distributed

A z-score is just the number of standard deviations away from the mean a data point lies. It doesn't require any assumption about the distribution of the data.

If you wanted to make some inference, like maybe about the likelihood of observing some z-score given a hypothesis, then you would need to assume a distribution.

But, nothing the author does assumes a distribution, normal or otherwise. It doesn't mention probability at all as far as I can see.


When you decide that a value N standard deviations away from the mean is an anomaly that must be investigated, you implicitly make an assumption about the underlying distribution of the data.


Exactly, it’s a smuggled assumption. A standard deviation is the second moment of a normal distribution. A negative binomial is a much better model for count data: https://statwonk.com/poisson-gamma-negative-binomial.html


The simple approach in the article is no different if you imagine it as a negative binomial. The article does not assume any distribution, it's entirely nonparametric.


That's not the case. Looking for |z|>t implicitly assumes the distribution is symmetric and unimodal for a start. As in, the method would break if those assumptions were violated.

Non-parametric means more than just avoiding talk of distributions.


Good point about symmetry & unimodality. You could easily construct a bimodal example where the method fails entirely.

In my previous post I said "doesn't assume any distribution" I meant it doesn't depend on any particular distribution function.

But in the post above that I was sloppy and said "no assumptions about the distribution at all". Good catch.


That's looking at it backwards---this is a non parametric approach. The author didn't assume a priori that a point with a z score greater than N was anamolous, for some fixed N.

Instead, they determined N by looking at the past data.

(Contrast with the parametric approach of assuming normality and concluding that any point lying >3 sd from the mean is an anamoly due to properties of the gaussian distribution.)


I don’t think this is correct . You can calibrate a threshold for a metric empirically without interpreting the metric probabilistically. Ie that z > N is empirically a good threshold for anomaly detection calls for no necessary further interpretation.


Sure, but for most processes that’ll yield disappointing performance.

Any tool can be used without regard for the consequences, but knowing a tool’s statistical properties yields know-how about the consequences of its application. It’s often a matter of the costs of error / stakes of the most problem at hand. Cheap solutions are often the best for cheap problems.


Good point... maybe first transforming the data to be approximately normal could be a solution? (I'm not a statistician)


My current take on this is that one should use a P99 (or P99.9, etc) median and anything above that is considered an anomaly.

You then are not assuming anything about the distribution.

I'm curious if others think this is a bad/good approach.


I like how this article builds everything up. Well written.


While this is all good, very few places AFAIK actually process their HTTP server logs into a form that can be queried by SQL. Google does it, certainly, but many smaller shops don't. Has this been your experience as well?


> very few places AFAIK actually process their HTTP server logs into a form that can be queried by SQL.

There are at least two tools that support directly querying log files with SQL:

* logparser -- https://en.wikipedia.org/wiki/Logparser

* lnav -- https://lnav.org

lnav uses SQLite and provides quite a few extensions, so quite a lot is possible. The logs are exposed via virtual tables, so there is no separate import step.


Opposite experience here. Most places of size I know of shove weblogs into SQL, S3, or some Hadoop infra.


Yeah SQL tends to be the lingua-franca for "big data" tools - so while logs usually won't be loaded into a regular MySQL/PostgreSQL it's pretty common for them to be made available in something like Hive that allows them to be analyzed using SQL.


I use simple bash scripts that tail the logs and insert them into an SQLite database that I query through Metabase. It's been working great for my projects and no issues with scale or downtime. There's a writeup on my blog if you want :)


I am doing exactly this in my current project - I put server logs of API services to column-based database with extensive SQL support, and it's feeling great. SQL allows me to build great analytics dashboard in a simple dialect that I know, and in terms of performance it runs circles around ELK stack (short animated GIF preview of UI is available on https://apiroad.net/ )


ApiRoad is an interesting project, are you running this all by yourself?

In any case, do you have a write-up somewhere of how you use a columnar DBMS to slurp your server logs?


Thank you! Yes this is a big solo project now. Didn't have time to do a write up yet. I am a big fan of Clickhouse, so I use it for log storage. To collect server logs, it is not exactly a simple file parsing setup - I have a gateway daemon that stays in front of the API server. The daemon receives all the requests from subscribers, authorizes that this specific connection has access to specific API, rate limits, and proxies all the connections to upstream API, dumping HTTP logs to Clickhouse once in a while. Then Vue.js & Laravel powered dashboard queries Clickhouse to generate various stats, which is later used for analytics and usage-based billing.


I haven't seen it that often either.

I like to store Apache & nginx logs as JSON so they can be parsed with tools like jq, and adding/removing fields doesn't break parsing.

From JSON it is also easy to pull them into PostgreSQL (at least) in jsonb format, or parse out key elements as their own regular table fields.


At least in the AWS ecosystem, Athena makes this very simple to implement. Stream load balancer logs to S3 and query directly.


Don't see any reason why even a solo developer couldn't/wouldn't do this? it's just one method to use.


That's what you took from that? They gave an example of using SQL to do anomaly detection that happened to use httpd logs as an example.

The strategies within could be applied to virtually any time-series data.


I have to say, the whole blog is pretty good. If you like this article, definitely stay and look around. It's a good example of proper old-school, deep but well explained engineering.


One can also use the sql SUBSTRING function to actually draw horizontal ASCII-art bar-charts in order to eyeball anomalies. Use a WITH sub-query to find the minimum and maximum range to normalize bar width range.

For larger data-sets you can group them, such as finding the min, max, and average for each given month, and then graph them using different bar symbols: M=min, X=max, A=average. Once you spot an odd month, you use the original detail version to study that month further. (Make sure you display the bars using a fixed-width font, such as Courier.)

    2020-01 MMMMMMMM
    2020-01 AAAAAAAAAAAAAAA
    2020-01 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    2020-02 MM
    2020-02 AAAAAAAAAAA
    2020-02 XXXXXXXXXXXXXXXXXXXXXXXXXXX
    2020-03 MMMMM
    2020-03 AAAAAAAAAAAAAAAAAAA
    2020-03 XXXXXXXXXXXXXXXXXXXXXXXXXXX
    Etc...


A tool like Metabase would make short work of visualizations of that nature.


But with this you don't have to install anything new: it's just leveraging your existing RDBMS. You just need a few sample SQL snippets to borrow from.


Amazing work here. The slow build, the clear code with highlighting, and the suggested improvements at the end - all made this easy to read and follow along!



My problem with this is that if you do the same thing, but remove the value 12 from the data you get 4 out of 8 values that are outside 1 stdev...


You don't need to and probably shouldn't remove the outliers from your dataset. You can easily query it with:

  SELECT * FROM series WHERE is_anomaly = f;
If you want you can wrap a (materialized) view around this.


Thnx, but my point was that if there are no outliers, there will still be values that are located more than one stdev from the average.


A nice trick to detect anomalies in an array of arbitrary data is to make N subarrays (each with different size(array)/N elements removed) and compress each of them.

The subarray that compresses the most has the anomaly removed.

It's not very cheap, but it's much more general than median or average or string comparisons. It relies on the fact that compressed size is estimate of Kolmogorov complexity.


I love how he(?) wears the "I'm not a datascientist" statement proudly like a badge of honour. Hah!

Nice article. My only comment would be that performing anomaly detection on the same data that you use to "train" your detector in the first place, is a classic example of a "data-leak" in data science, and very likely to lead to errors in itself.


What would you consider the "training" step here? The part where the author performs backtesting to settle on an adequate value for the threshold? In the case I don't really see a problem, since the author seems to be looking at historical data to come up with a value for a parameter that will be used to determine the model that acts on future data.

I agree with your comment in general that using test data in your training set is a clear example of an error that will lead to overfitting and a model that generalizes poorly, but am having a hard time seeing where the author commits this mistake.


The way he describes the problem in the introduction alludes to this. I.e. "here's a some data, 12 is a clear outlier, let's see if we can confirm this using the mean and standard deviation of the sample to derive lower and upper bounds".

Then he proceeds to include 12 in the calculation deriving these bounds. This is not really the way to do it. In fact, if he had excluded the anomalous measurement from the training data, the '5' values would have been excluded as outliers as well, given his criteria for defining the bounds.

I agree that this is a trivial point made on a trivial example though, and that it is more a matter of 'sensible definitions' of what counts as anomalous or as training set in the first place. But it's still worth thinking about explicitly though, so I thought I'd mention it.


Why do you need a train/test split for a z-score? It's just telling you if data is extreme beyond some threshold defined by the distribution from which it originated.


Replace the value 12 in the introduction by 1,000,000 and see if you still agree with your premise :)


My only comment would be that performing anomaly detection on the same data that you use to "train" your detector in the first place, is a classic example of a "data-leak" in data science, and very likely to lead to errors in itself.

But thats the whole point of the very well established concept of standard deviation - to look at a dataset in isolation and analyse it.


In machine learning there is the concept of a generalization bound where you can determine if your model has successfully generalized without reference to a validation dataset. In which case, your training and real world dataset can be the same.


so, how do you gather the correct data to "train" the detector?


It's not really a question of the "correct" data. To evaluate whether an observation is anomalous, you typically define a threshold for what's considered not-normal, e.g. an observation 2 or 3 standard deviations above the mean.

This may leave you with some false positives (like any system of this nature would). Of course, you could go the route of actually defining an anomaly yourself and building a more sophisticated model (i.e. one not using z-scores as the measure of anomaly), but that's obviously a different scope.


Nice writeup. In case anyone wants to play around locally, the data used is present in the interactive editor linked in the article. You can just create the server_log_summary table as per the create table statement in the article and insert the data into a local PostgreSQL instance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: