Loupe: Etsy's New Monitoring Stack

noelwelsh · on June 11, 2013

Interesting stuff! I've actually been working on the same idea recently, starting with reading about anomaly detection. In particular, this survey: http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php

I would like to know more about the performance of Skyline in practice:

- what are the accuracy and recall like?

- what is CPU consumption like?

Regarding the latter, I had a quick look at the implemented algorithms and them seemed very inefficient. Basically recomputing over the entire series at every change. I think with a bit of work most of the algorithms could be reimplemented in an incremental way. I also wouldn't use Python for something that is going to be CPU bound. (I await the "We rewrote in Go and it's 10x faster!" blog post ;-)

aba_sababa · on June 11, 2013

Author here. Accuracy is okay - we err on the side of noise, but it does routinely pick up anomalies. It doesn't currently account for seasonal trends, though.

We aim for 100% CPU consumption. Analyzing is very CPU intensive process, and there are two parts in particular that are expensive: decoding the Redis string from MessagePack to Python, and running the algorithms.

As for the algorithm inefficiencies, pull requests encouraged :)

Rewriting it in Go is a plan for a rainy weekend :) The problem with Go is that it doesn't have as great statistics support as Python does.

noelwelsh · on June 11, 2013

Thanks for replying! A little while ago I watched part of your Bacon Conf talk (http://devslovebacon.com/conferences/bacon-2013/talks/bring-...) and read the slides.

There were a few things I thought were a bit odd about the architecture, as I recall it.

IIRC you poll Graphite for metrics. Why not push them from StatsD directly into Skyline? This would probably be more efficient. If you used incremental / online / streaming algorithms you'll have a compact summary at each time step, so you can throw away the raw data. 250K metrics would fit in memory quite easily (we're just talking approximately a number and a string each, right?) and you have 4000+ cycles per second to process them, which should be sufficient.

Python lack of good threading would possibly be a problem. I would use the JVM (Scala in my case). Apache Commons Math is pretty good (http://commons.apache.org/proper/commons-math/). Java's verbose interfaces are a bit annoying, but the JVM is damn efficient, and you can wrap the crap is something more aesthetic. It's a solid choice no matter what the hipsters say. ;-)

aba_sababa · on June 11, 2013

Ah! We could, but StatsD provides support for more complex metrics like aggregated sums over time. Not something that lends itself easily to a discrete datapoint.

It is. But we use multiprocessing, which is basically the same API. Still, you can't beat the awesome Python stats libraries: Numpy, SciPy, Statsmodels, Pandas..

boothead · on June 12, 2013

If you're analyzing streams of data Haskell would be a good choice.*

*assumes that you've already put in the requisite couple of years in monastic contemplation to learn Haskell :-)

tel · on June 12, 2013

DTW is quadratic. While in grad school I worked for a bit with a team interested in doing massive speech recognition using DTW so they did some work speeding up the algorithm using a technical called Locality Sensitive Hashing [1], [2], [3]. It might be worth a look in order to speed your algorithms.

[1] http://www.academia.edu/2600658/Indexing_Raw_Acoustic_Featur... [2] http://old-site.clsp.jhu.edu/~ajansen/papers/IS2012a.pdf [3] http://www.cs.jhu.edu/~vandurme/papers/JansenVanDurmeASRU11....

jqueryin · on June 12, 2013

I was very intrigued until I poked around the github repos and noticed the server specs Etsy was using to analyze the 250k metrics.

Oculus recommended setup (found at https://github.com/etsy/oculus):

  * ElasticSearch
    * At least 8GB RAM
    * Quad Core  Xeon 5620 CPU or comparable
    * 1GB disk space
    * two ElasticSearch servers in separate clusters
  * a cluster of Worker boxes running Resque 
    * worker master runs redis
    * additional resque worker boxes (and potentially slaves)
    * At least 12GB RAM
    * Quad Core Xeon 5620 CPU or comparable
    * 1GB disk space

It'd be nice if there was a more established baseline set of server specs to get up and running. While many of us aspire to be at Etsy level monitoring, we're just not there.

jonlives · on June 12, 2013

I'll definitely have a look at doing that - the initial specs were designed around the metric volumes we use the tools for, but I realise that might not be practical for smaller workloads :)

laichzeit0 · on June 12, 2013

I'm not really convinced that this is very useful. I've been in the application monitoring space for a few years now and I'm not sure that watching graphs is something Ops people should be doing.

There should be rules which notify them if something is anomalous, by email, SMS, or logging a problem on an incident management tool. e.g. "Java request foo.bar() on Managed Server 1 is throwing exceptions for 50% of invocations (20 requests, 10 exceptions) in the last 10 minutes. This affects the following services: Customer Login page on foo.bar." possibly even attaching some of the exception messages to the email, if sampled through instrumentation or correlating it back to the log files, automatically.

This type of monitoring is actually useful because Ops understand what is broken, what it effects and gives them enough detail to either fix it or pass the problem to someone else; and they're not wasting their time looking at graphs waiting for a problem to appear.

thibaut_barrere · on June 12, 2013

The thing is that at a given scale (and it comes early, actually), pushes do not scale.

I still use pushes for clear-cut things that require paging, but having graphes of a lot of things and just noticing changes or anomaly on the overall patterns will help spot a lot of issues, including things you haven't yet planned paging for :-)

sokoloff · on June 14, 2013

Totally agree. We have a mix of algorithmic/automated monitor and visual monitors. I can tell you a lot more about how healthy our site is from looking at two screens in our NOC than I can from all the pages sent over the last <pick your time range>.

Computers are great at executing repetitive, specified tasks. Use them for that.

Humans are great at pattern recognition and flexibly adapting. Use them for that, IMO.

jlgreco · on June 11, 2013

Very nifty. The automatic selection is a great innovation.

I built a somewhat similar system a while ago on-top of statsd/graphite. Mine was not designed for production deployment though, just as a test platform (I was basically using graphite to store and query metric data. Not optimal, but that problem was out of scope and it was easy to abuse like that.) This tool allowed a user to manually select a set of metrics and create a fault classifiers with those metrics.

These classifiers were able to detect not only the presence of faults but also classify what type of faults they were (provided sufficient training data. Of course you could train new classifiers with data you collected in production so training new classifiers becomes an ongoing activity.). We were only testing geometric classification, but using any sort of classifier to identify complex fault types seems to be an idea with promise.

cilo · on June 11, 2013

Always fun to read these Etsy ops posts. I'm very curious to know what their practical architecture looks like that allows them to capture 250k unique metrics and also run skyline against them all. It seems like each new algorithm would add a ton of processing requirements when you're at that scale.

Also, it seems like this would be really useful with the addition of metrics grouping and group specific algorithms as right now it looks like their 250k metrics all pop up in the same anomalous bucket with all metrics getting the same algorithms applied to them.

oscilloscope · on June 11, 2013

Abe gave a talk at OpenVisConf about Skyline and how to make use of 250k metrics http://www.youtube.com/watch?v=Rij604NBXqk

plasma · on June 11, 2013

Anyone recommend an easy way to get started with StatsD?

I've tried to configure/install/setup StatsD etc in the past but hit so many problems with dependencies, undocumented software needing to be installed, etc.

Any tutorial or something to get stats being tracked and graphed beautifully would be awesome.

thibaut_barrere · on June 12, 2013

Batsd [1] is a stripped down version of (StatsD + Graphite) that works well in my opinion. You won't have the full graphite functions etc, but it's easier to get started.

[1] https://github.com/noahhl/batsd

russgray · on June 12, 2013

This blog post helped me when I took a look last year: http://geek.michaelgrace.org/2011/09/installing-statsd-on-ub...

kawsper · on June 12, 2013

It just too bad there is a lot of moving parts in this setup.

Graphite consists of these three parts:

carbon - a daemon that listens for time-series data. whisper - a simple database library for storing time-series data. webapp - a (Django) webapp that renders graphs on demand.

And statsd is its own daemon.

That means 3 daemons needs to run to make stat aggregating.

ghotli · on June 12, 2013

Just use the existing chef recipies out there for setting it up. Why would you try to get all the moving parts integrated properly when the work has already been done? Stand on the shoulders of giants.

kawsper · on June 13, 2013

> when the work has already been done?

I am very wary about introducing new software into our stack, if I doesn't understand it. Bad configured software could cause problems down the road.

Last time I tried out a recipe that installed Redis, it didn't version lock the Redis-server which meant that the daemon couldn't start because they had deprecated some configs.

The recent DDOS DNS attacks was possible because people have setup wrongly configured DNS servers.

jmcqk6 · on June 11, 2013

They might want to reconsider the name:

http://www.gibraltarsoftware.com/

Their monitoring solution is also called Loupe.

contingencies · on June 12, 2013

Skyline and oculus both look interesting and this is definitely a solid direction to be heading in.

However, I wonder if some form of topology knowledge, operations dependency tree or similar could further inform this type of root cause analytics.

Without a declarative style "here is how thing should be" model of adequate accuracy, it seems like the analytics will be stuck at the "these things are strange and happened at once, what does human think?" level of sophistication.

philsnow · on June 11, 2013

I'm confused on one point: does the anomaly correlation find other metrics that look "similar" or other metrics that also have anomalies in the same time span ? The latter seems like it would be very useful.

You mention elsewhere that statsd lets you do complicated aggregations over time. If you have a moving average of errors over 10 minutes or something, that's potentially not going to show up when you do anomaly correlation, since a spike is smeared across 20 minutes. Do you account for that? It would require knowing which metrics are aggregated across time and by how much, etc, I guess.

jonlives · on June 11, 2013

Oculus author here - Oculus detects other metrics that look similar, ie that have a simlar anomaly or shape in the same time span. It doesn't pick up metrics that have other, "dissimilar" anomalies - that part is left to Skyline

Oculus treats all metrics that it gets from Skyline equally at the moment, ie it doesn't know if what it's looking at is an aggregation, or a single set of data points. It just takes the data as it's presented. It would be totally possible, however, to add 10 and 20 minute averages (for example) for the same metric into Skyline so that Oculus would treat them separately.

josh2600 · on June 11, 2013

This is really interesting. I can't even imagine measuring 250k different metrics without a tool like this. It's just so much data to assess.

Granted it would be extremely useful for post-mortems, but looking at it real time is a bit like the library of Babel [1].

[1]http://en.wikipedia.org/wiki/The_Library_of_Babel

methehack · on June 12, 2013

> That’s far too many graphs for a team of 150 engineers to watch all day long!

Does etsy have 150 engineers? Is that even possible?

methehack · on June 12, 2013

I'll take the downvote as a "yes" :)

It's true the "Is that even possible?" was out of line and I should have tempered it -- but, I am truly surprised.

misiti3780 · on June 12, 2013

I love etsy's open source contributions but am i the only one here that think's they are due for a blog redesign?

aba_sababa · on June 12, 2013

It's in the works :)