Why I'll be letting Nagios live on

jccooper · on March 4, 2014

Quick summary, since the site is down:

In response (obviously) to http://www.slideshare.net/superdupersheep/stop-using-nagios-... (also available through HN).

Author uses Nagios at Etsy, with 10,000 checks (mostly in the 2 minute range), and it seems to work well for them with some minor tweaking. They have a plugin that provides a REST API. And he thinks the rest of the complaints are about backend stuff that he doesn't deal with much or at all (configs, wire formats, et al.)

He considers Nagios to be simpler than the proposed Sensu, and prefers "Unix style" applications rather than monolithic ones, where Nagios is certainly on the simpler side.

So he will continue using Nagios because it works well for his use, and good luck making something better.

toomuchtodo · on March 4, 2014

> So he will continue using Nagios because it works well for his use, and good luck making something better.

Ops guy here. If something better than Nagios/Zabbix came along, ops people would use it. Nagios gets used because 1) people know it and 2) its pretty easy to setup.

Is it going to scale to Netflix or AWS scale organizations? Probably not. But at that scale, I presume you have buckets of money to throw at problems (i.e. build a custom monitoring platform).

nailer · on March 4, 2014

Back in 2009/10 I used an 'enterprise' competitor to both that was proprietary and 'definitely nothing to do with Zabbix' according to the vendor.

A simple 'strings' on the proprietary app's agents revealed a whole lot of 'zbx'.

Link: https://www.zabbix.com/forum/showthread.php?t=10155&page=5

nobodysfool · on March 4, 2014

That's like saying 'If something better tha PHP came along, programmers would use it'. Something better did come along, but that doesn't change the fact that PHP is still popular.

akerl_ · on March 4, 2014

It's not like saying that at all. There aren't a plethora of full-featured monitoring/alerting tools out there.

on March 4, 2014

[deleted]

koenigdavidmj · on March 4, 2014

Ten seconds on Facebook shows that the Laurie Denness who works at Etsy is most decidedly not a woman.

rjzzleep · on March 4, 2014

i love the diagram making fun of too complex systems

https://pbs.twimg.com/media/BNELF1GCUAExynU.png:large

for reference the architectural diagram of sensu

http://portertech.ca/images/2011-11-01/sensu-diagram.png

beaker52 · on March 4, 2014

Hopefully Nagios is showing him some red metrics right about now.

yeukhon · on March 4, 2014

The first bulletin point: “Doesn’t scale at all.” I wonder how much traffic is sending to his blog right now. Just in 1 hour we DDoS-ed his blog with HNers visiting his blog.

toomuchtodo · on March 4, 2014

I'm going to buy pg and company a beer and get code integrated into HackerNews that turns all submitted links into Coral cache links. Its 2014 and the Slashdot effect is alive and well :(

yeukhon · on March 4, 2014

That's actually a great idea! :) Haven't actually thought of that. Two things in mind: rewrite links but provides copy-paste for the original link (think about XSS from clicking a link on tweet and people want to look at the link before clicking it) and (2) HN's rewrite/shortened link must do dynamic cache url retrieval; not static. I think Google's cache link is different as cache is updated.

toomuchtodo · on March 4, 2014

Thanks! I agree with both points; I'd even suggest adding a link to the Internet Archive's copy, but baby steps.

rrreese · on March 4, 2014

Google cache: http://webcache.googleusercontent.com/search?q=cache%3Ahttps...

lozzd · on March 4, 2014

Well, this is embarrassing.

My colo server died earlier today (completely unrelated to this) and didn't come back up because GRUB hadn't reinstalled properly on a replaced software RAID disk.

But yes, Nagios did tell me it was down :)

jcmcken · on March 5, 2014

> ...just the architectural diagram of how it works scares the shit out of me. When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly.

This strikes me as a pretty lazy argument. Admittedly, that diagram is not the best. But that you can separate the components of Sensu isn't a bug, it's a feature. No one says you have to do it this way. In fact, you can exactly replicate the architecture of Nagios by having a single server. The point is you have choices, choices which are dependent not on the monitoring software (which is very lightweight by design in the case of Sensu), but on the other open source software it relies on (Redis, RMQ).

So I would argue that the "server instance" scope is way too broad a category to measure complexity. If you attempted to diagram the workflow Nagios uses (ignoring for a moment what server instance each component is on), you would come up with something equally bad (if not worse). That's if you even understood anything at all about how Nagios works to know what to diagram.

So let's replace one crude measure with another. The Sensu core repo is ~3MB total. Nagios core is about 10x that (30MB). NRPE is about 1MB all by itself. Mod_gearman (to pick out an add-on) comes in at a whopping 6MB. Suffice it to say, but for something that's basically a glorified exit code validator, this seems like a lot of complexity. Sure, Nagios has a lot of features that Sensu doesn't have, and that accounts for some of this. But there's a lot to be said for modular systems vs. monolithic ones.

lnanek2 · on March 4, 2014

Is this some sort of joke? Maybe it is intentionally down and alerting him? :)

nasalgoat · on March 5, 2014

His point about the NRPE config is spot-on - just put all your checks in one file. Done.

Not sure why people are making Nagios configuration more complicated than it needs to be.

Torn · on March 4, 2014

Site down - anyone got a mirror?

mzs · on March 4, 2014

http://webcache.googleusercontent.com/search?q=cache:mMFIAge...