Hacker News new | past | comments | ask | show | jobs | submit login

I'm not experienced with the CollectD stack, but I use Prometheus + Grafana to monitor probes. My two cents:

- Fairly lightweight. Prometheus deals with quite a lot of series without much memory or CPU usage.

- Integration with a lot of applications. Prometheus lets me monitor not only the system, but other applications such as Elastic, Nginx, PostgreSQL, network drivers... Sometimes I need an extra exporter, but they tend to be very light on resources. Also, with mtail (which is again super lightweight) I can convert logs to metrics with simple regexes.

- Number of metrics. For instance, several times I needed to diagnose an outage and I need a metric that I didn't think about, and turns out that the exporter I was using did actually store it, it was just that I didn't include it in the dashboard. As an example, the default node exporter has very detailed I/O metrics, systemd collectors, network metrics... They're quite useful.

- Metric correctness. Prometheus appears to be at least decent at dealing with rate calculations and counter resets. Other monitoring systems are worse and it wasn't weird to find a 20000% CPU usage alert due to a counter reset.

- Alerts. Prometheus can generate alerts with quite a lot of flexibility, and the AlertManager is a pretty nice router for those alerts (e.g., I can receive all alerts in a separate mail, but critical alerts are also sent in a Slack channel).

- Community support. It seems the community is adopting the Prometheus format for exposing metrics, and there are packages for Python, Go and probably more languages. Also, the people who make the exporters tend to also make dashboards, so you almost always have a starting point that you can fine-tune later.

- Ease of setup. It's just YAML files, I have an Ansible role for automation but you can go with just installing one or two packages in clients and adding a line to a configuration file in the Prometheus master node.

- Ease of use. It's incredibly easy to make new graphs and dashboards with Prometheus and Grafana, no matter if they're simple or complex.

For me, the main points that make me use Prometheus (or any other monitoring config above simple scripts) is alerting and the amount of metrics. If you just need to monitor CPU load and simple stats, maybe Prometheus is too much, but it's not that hard to set up anyways.




Author here. I'll probably write another tutorial focusing on Prometheus, instead of CollectD.

Thanks for suggestion

SerHack


It would be wonderful if you included limitations as well, to help people make the right decisions for their tech stack. I've been playing around with Prometheus lately for environmental monitoring, and long-term retention is particularly important to me.

During proof-of-concept testing, some historical data on disk perhaps wasn't lost per se, but definitely failed to load on restart. I haven't worked hard to replicate this but there are some similar unsolved tickets out there.

Additional traps for new players include customizing --storage.tsdb.retention.time and related parameters.


Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: