At my company we've been using Graphite and StatsD for nearly two years now, we rely on it heavily for tracking performance and troubleshooting issues. We rely on Icinga, Pingdom, NewRelic and other tools to alert of us of problems.
Often, when things have gone really wrong (DoS, internal network issues, app errors, disk full) the affected machine(s) stop reporting to graphite (or under-report data). We get alerted by monitoring the services, not the stats.
Being alerted about low or unusual values might be helpful in some cases, but based on my experience, it would too noisy. Usually when something bad happens, we anyway investigate Graphite and analytics tools to understand the impact on traffic and KPIs.
I could see Rearview being useful for some cases, but not as a replacement for real monitoring and alerting tools.
We use NewRelic and Pingdom as well. Where Rearview really shines is creating monitors like this: 1) control charts to alert when a process deviates from a range of 3 stdev above or below the mean based on historical data (e.g. purchases/logins are lower than expected, process failures are higher than expected, etc.), 2) deployment triggered monitors that automatically analyze data before and after a deploy for shifts in mean or increases in variance (e.g. do we see more login failures after this deploy, do we see more 4xx/5xx responses, did page load time increase, etc.), 3) response time monitors... while this seems straightforward enough, Rearview can not only tell you when a service or page response time has exceeded some statistical limit, it can also present you with more information regarding causes (e.g. this process is slow because of an issue with the database, redis, a dependent process/service, etc.), 4) it allows you to use SPAN as a means of monitoring load time or response time (SPAN is the 95th percentile - the 5th percentile and it give a much more accurate representation of what users experience than mean or median, 5) process efficiencies can be checked by making sure they complete on time and execute the expected number of commands (e.g. sent email, updated databases, etc.), and many more. Basically you are only limited by your imagination and coding skills. Of course the other benefit is in performing similar monitoring on business metrics and not just application performance (e.g. is funnel performing as expected/needed, are our customer tools being used on a regular basis, are our marketing campaigns paying off, etc.)
In my currently non-existent freetime, I'm a Graphite co-maintainer (check github). If you have any improvements or suggestions, please feel free to send us pull requests. The current pull requests are a bit of a mess, but I blame myself and will be getting around to merging a ton of them "real soon now TM".
Server side graphs didn't work out for all our monitoring use, so we don't use graphite. You should make a version that works with istatd :-)
https://github.com/imvu-open/istatd
This looks really polished and definitely a great idea. I can see why you chose Ruby for the scripting of the monitors, being able to evaluate that code in a predefined binding can be quite powerful, especially with the aid of helpers being pre-defined as well.
Why not a full ruby stack, or was the "live" scripting done after the initial inception?
We have always used Ruby for the scripting (we're predominately a Ruby shop so this was key for future adoption.) The very first mvp for this tool was individual Ruby scripts running against Graphite and being scheduled via cron. The first real backend scheduler was built in Scala, but for various reasons we've converted to Rails/Puma/Celluloid running in a VM using Jruby. The monitors themselves run in an MRI sandbox for security purposes.
I'm not sure I'm ready to abandon a custom monitoring environment consisting of a shell environment, screen, ssh certs, lugubrious quantities of /proc/, and a fair bit of gnuplot. Seems to me thats all you need? Why commit to a Ruby install for an operator console?
I'm not sure lugubrious means what you intended it to mean. :) At any rate, see my reply here https://news.ycombinator.com/item?id=6646402 for a sampling of things Rearview brings to the table. The tl;dr is that it's not a NOC tool, it's more for process monitoring whether that be application processes, engineering processes, or business processes. It also does provide a central location for anyone to see the state and history of an application or business unit.
>The tl;dr is that it's not a NOC tool, it's more for process monitoring whether that be application processes, engineering processes, or business processes. It also does provide a central location for anyone to see the state and history of an application or business unit.
Actually, Rearview started out similar to that. We wanted something more accessible to our large engineering team. It has all the advantages of a web user interface coupled with the powerful capabilities you get with a scripting language.
Your welcome! We did a second feature release of our Ruby version today that has even more UI goodness. Basically we've added the ability to group categories of monitors under one dashboard. You can then switch between categories using carousel controls or direct from drop down. We're hoping to open source this version soon and crossing our fingers that the Ruby version will see more collaboration from outside developers.
Often, when things have gone really wrong (DoS, internal network issues, app errors, disk full) the affected machine(s) stop reporting to graphite (or under-report data). We get alerted by monitoring the services, not the stats.
Being alerted about low or unusual values might be helpful in some cases, but based on my experience, it would too noisy. Usually when something bad happens, we anyway investigate Graphite and analytics tools to understand the impact on traffic and KPIs.
I could see Rearview being useful for some cases, but not as a replacement for real monitoring and alerting tools.