Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What tools do you use to monitor your LAMP server(s)?
55 points by olalonde on Oct 23, 2010 | hide | past | favorite | 38 comments
What tools do you use to monitor your web server(s)? More specifically, server uptime, resource usage (CPU, RAM, bandwidth, etc.), Apache, MySQL and PHP/Ruby/Python.



We have a cluster of about 60 servers, and use Monit and Ganglia to manage it. Monit will alert you when certain problems occur -- for example, if the server load is too high, if a service stopped running, or if can't reach the database. Ganglia records data and graphs that data over time. We use it to see how busy our server are getting, or how our peak resource usage compares to off-peak. Monit is important for noticing a problem; Ganglia's useful for diagnosing that problem.

That said, I'm a huge fan of Cloudkick. Their software is ridiculously easy to install and use, and the web application makes it VERY easy to understand what's going on with your services. It's perfect if you're just monitoring one server -- it's free -- but even worth the money if you've got a whole cluster. If I could do it all over again, I'd definitely go with Cloudkick.


You would have to pay $11,388 per year to use Cloudkick with 60 servers and still lose all data >1 year.

I don't get the logic of paying that much for Cloudkick at all. Monitoring tends to be a couple days work (at most) to setup perfectly and then very little on-going. Hardly worth $949 every month IMHO.


In my experience with running nagios it's never as easy as set and forget to keep your monitoring system effective. Even more so when you're growing or scaling and systems are changing more regularly. To me that $11,388 per year is cheap compared to the costs of having a salaried employee spending more time on a slower to implement solution.

If you don't have a full time sys admin it's also one of those things that is very easy to put off or delay or just not keep on top of.

As for the data loss I can't think of a time I've ever wanted to see resource statistics for over a year ago and down times and outages are kept in a problem tracking system.


One thing I've done in the past to help prevent 'configuration rot' of nagios configs is to hook it into the same files that drove our deployments - when the deployment changes, the nagios config changes with it. Nagios supports template-based configs so once the tests are developed it isn't too hard to write some automation that spits out a config file. This worked pretty well for a site with ~350 servers and devices (switches, APC rack PDU's) with ~3000 tests.


http://scoutapp.com again saved my bacon yesterday when my god monitoring script for DelayedJob workers decided to pull an Ark with my processes ("let's have two of everything!") and ran out of swap, bringing the server to a virtual standstill. Scout sent me an email, the email rang my cell phone, and I was able to recover before it ruined my best day of sales ever.


We actually just talked about the monitoring stack we use at Scout. There isn't one do-it-all tool (if there is, it'd be pretty ugly).

Scout,Monit,Hoptoad,New Relic, and Pingdom:

http://blog.scoutapp.com/articles/2010/10/19/monitor-rails-c...


Nagios is great for alerts and problem monitoring. Cacti great for visualizing performance over time. Both take a bit of time and effort to set up on most popular Linux distros, but are well worth it. I'm using a base level Rackspace cloud server and costs about $20/mo.

For a reasonably priced hosted solution, check out cloudkick.com. It is also very good and let's you monitor ec2, Rackspace cloud servers, and your own boxes with an agent installed.


I use munin to monitor http://codeboff.in. The stats page is up at http://codeboff.in:8080/ if you want to take a look.

Also I use Supervisor (http://supervisord.org/) to launch all my processes and it is configured to send me an email if any of them grows beyond a memory threshold and/or crashes (in which case it also attempts to restart it a few times).


Thanks for the link to supervisorctl - didn't know that one - and it looks pretty neat! :P


For alerts i am using nagios as it has all kind of checks for most services integrated. Besides services i also monitor security updates (any security upgrades available which are not installed) for debian hosts and resource shortage of openvz containers.

I am also using monit for immediate actions like restarting web server or checking some programs.

I manage both via puppet, that means I deploy a new host and the nagios configuration gets adjusted automatically with the corresponding services.

For performance data there are several tools like cacti, munin or collectd.

- Munin has iirc the problem thats it gets slowish as it polls all data from hosts then generate the carts for all hosts. Only a problem if you have many hosts...

- Collectd is a quite nice and fast monitoring solution (supports updates every second) and has advanced features like sending the data to multicast addresses. The only imho major drawback is that there comes no good webui bundled with collectd though many external uis exists.

EDIT: I forgot to mention icigna http://www.icinga.org/ a fork of nagios after some trademark problems iirc.

Monit and nagios are integrated with the ticketing system, e.g. recovery message closes automatically the corresponding ticket.


Can you share the relevant snippets from your puppet setup? I'm currently looking at using puppet with nagios/monit to manage the 50+ servers that recently became my responsibility.


If you start with puppet you should definitely have a look at the type documentation http://docs.puppetlabs.com/references/stable/type.html It helps a lot to see and understand the already integrated types for puppet.

For monit i am using the monit pattern from[1]. Quite simple but IWFM. Nagios is a little bit more complicated. If you have any questions i added my mailaddresse to my profile.

You need to use exported resources for the nagios part. Therefore you have to enable Stored Configurations on the puppet server [2]. For the server part i use:

  package { nagios3: ensure => present }
  service{nagios3:
    ensure => running,
    enable => true,
    require => Package[nagios3],
  }
  exec { "chmod_nagios":
    command => "/bin/chmod 644 /etc/nagios3/conf.d/*",
    refreshonly => true,
    notify => Service["nagios3"]
  }
  file { "/etc/nagios3/conf.d": ensure => directory }

  Nagios_host <<||>>    { notify => Exec["chmod_nagios"] }
  Nagios_service <<||>> { notify => Exec["chmod_nagios"] }


The client setup is quite easy, im using nagios-nrpe for the check. You have to deploy your own configuration files, i omit them here.

  package { nagios-nrpe-server: ensure => installed }

  service { nagios-nrpe-server:
    ensure => running,
    require => Package[nagios-nrpe-server],
    pattern => "/usr/sbin/nrpe"
  }
  @@nagios_host { "host_$hostname":
    ensure => present,
    address => $ipaddress,
    host_name => $hostname,
    use => "generic-host",
    target => "/etc/nagios3/conf.d/host-$hostname.cfg",
  }
  @@nagios_service { "apt-${hostname}":
    ensure => present,
    use => "generic-service",
    host_name => $hostname,
    service_description => "apt check",
    target => "/etc/nagios3/conf.d/apt.cfg",
    check_command => "check_nrpe_1arg!check_apt"
  }



[1] http://projects.puppetlabs.com/projects/1/wiki/Monit_Pattern...

[2] http://projects.puppetlabs.com/projects/puppet/wiki/Using_St... You should not use sqlite, use mysql if you want to use the dashboard

  [puppetmasterd]
   storeconfigs = true
   dbadapter = sqlite3
   dblocation = /var/lib/puppet/storeconfigs.sqlite


I've tried many packages, have settled on Nagios + Cacti. Featureful, rock solid, free, scales well, documentation galore. How can you beat that? A live example -- Wikipedia uses Nagios for their monitoring solution: http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=a... , they monitor more than 2,000 services on 426 hosts. At my college, Nagios has an icon for each host that links to a corresponding wiki page.


We prefer to just let somebody else manage it: ServerDensity is the best we have found. You can also use PagerDuty in tandem with ServerDensity.


New Relic, Nagios, Cacti, Pingdom, few custom alerts/stat graphing (using flot), alerts and on-call handling between all with PagerDuty

Also working on integrating Splunk for analyzing log files and alerting on a few aspects.


Munin for stats, nagios for reports / actions, monit for keeping things running in general. I don't use monit for resource management, since it's just not complex enough for some of the stuff I need to check.

I tried cacti once and god it's bad... It works, but the lack of a simple overview of how it works / how it's supposed to be configured is just too clear.

I'm keeping an eye on http://www.shinken-monitoring.org/ - might be worth checking out in the future if I ever get into large-scale-nagios problems.


custom network management system that does icmp polling and service checks, snmp polling for bandwidth, cpu, memory, and server-specific data that is exported server-side (a lot of it is gathered from log files) and collected through snmp.

alerts through jabber, email, and sms. i setup a monitor and cheapo eeebox to watch the data in realtime for my most important stuff.

http://www.flickr.com/photos/symmetricalism/4435593589/


I've been very happy with Scout (http://scoutapp.com/) for monitoring. It's a nice mix of standard monitoring and a pretty simple plugin system for custom monitoring.

All of the plugins are written in Ruby and the interface for reporting data is pretty simple.

I'm using it for everything from monitoring the KB/s that systems are swapping (and alert if it's anything measurable) as well as custom things like job queue depths.


We use monit combined with a number of helper bash scripts running in cron. For example, we have a cron job that does integrity checks on sqlite databases and writes to a file if it encounters corruption. Monit is set to monitor that file for size change and sends an alert if anything is written to it. That kind of thing is a bit of a hack, but seems to be the way to do it if you are using monit.

We also use pingdom for external uptime monitoring.


You should check out Rigor for performance and uptime monitoring of your web servers. It's a subscription service that goes a step beyond the basic cpu/process details and lets you monitor the accessibility of individual pages and transactions. They have a free trial that takes two minutes to set up at http://rigor.com.


STACK: MySQL, RoR, Nginx, Mongrel, Monit Fun. We use:

- Newrelic

- Engineyard: http monitor

- Hoptoad for errors

- Some custom daemon scripts that fire emails when background processes aren't working.


I'm definitely biased, but I use the product that I develop, Netmon -- www.netmon.ca. It may be worth checking out for you.


Collectd + Visage (http://github.com/auxesis/visage) Monit + Monit Aggregator (http://github.com/mattfawcett/monit-aggregator)

I've hacked on both a bit, but they're great starts for small clusters.


Logcheck. This lets mildly-important things annoy you until you fix them. Lots of "failed login for root" everyday, so I changed my ssh server to only allow logins as me. Lots of authentication failures for SMTP AUTH, so I enabled fail2ban. Now I don't get any annoying emails, and my box is slightly more secure.


Opsview. http://www.opsview.com/

It's built on top of Nagios, and makes the entire process very easy and straightforward to setup and run.

Edit: The community version is free, and I haven't run into any limitations on it. The paid version offers a few extra modules.


you can use Cactus to monitor server load/cpu/etc, I'm not sure about the others...


This is the program/code alexweber is referring too:

http://www.cacti.net/


Shocked at the lack of zabbix. Alerts, graphs, history, all in one place. Install the agent and you get the basics for free - I just set up a demo of it a couple days ago in 30 minutes flat.


Me too. This is what I use. It was dead simple to setup. It had great default settings. Pfsense also has an agent package available for it and is a one click setup. It was also very easy to setup custom scripts for monitoring my custom back end software. I tried nagios, hobbit, opennms, and others recently and zabbix worked best for me.


I've been using zabbix for around 4 years now for all my monitoring needs, and i love it. In fact, I love it so much, that I decided to launch a hosted 'zabbix as a service' company this year, check it out: http://tribily.com


For security monitoring: OSSEC (open source at ossec.net) and http://sucuri.net (paid external mon)


Munin/monit


Xymon (formerly known as Big Brother). Yeah it's old, but it's simple, reliable and what I know. I've tried to switch to Nagios and OpManager, but ended up back to Xymon.

I use Cacti for graphing. I have some big gripes with it, but I've invested a lot of brain matter in it and it works.

http://xymon.com

http://cacti.net

http://www.opmanager.com (slow, unreliable j2ee app with a lot of features and great tech support.)


I use pingdom.com for external monitoring, especially for response time reports.


Munin


Nagios


Nagios is probably the best free option, although configuration can be pretty complicated. I like AppFirst, it does most of the stuff you want out of the box, and you can use any Nagios plugin to add specific functionality for whatever else you want.


host-tracker.com




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: