I found the software in this stack to be very bloated and difficult to maintain. Large, complicated software has a tendency to fall flat on its face when something goes wrong, and this is a domain where reliability is paramount. I wrote about a different approach (Prometheus + Alertmanager) for sourcehut, if you're curious:
I have a bit of a love/hate relationship with Prometheus. At home I really like it; it was simple to set up for my needs and most of my configuration is on my server which then scrapes other machines for the data. However I find it quite frustrating at scale for work, both in its concepts (it's hard to describe but it's sort of...backwards?) and in its query performance, although that might be a side-effect of using it with Grafana and me attempting to misuse it. By contrast I think the concepts of something like TimescaleDB are easier to understand when it comes to scaling and optimising that service.
In my previous job I had a very clear use-case for not using Prometheus and did for a while use InfluxDB (it involved devices sending data from behind firewalls across many sites). I found it pretty expensive to scale and it fell over when it ran out of storage, which feels like something that should have been handled automatically considering it was a PaaS offering.
One point of note for SourceHut's Prometheus use is that we generally don't make dashboards. I don't really like Grafana. I will sometimes use gnuplot with styx to plot graphs on an as-needed basis:
I have a similar relationship to Grafana as I do for Prometheus; love it for my home and I've got some very useful graphs for my home network, but it's almost unusable for work due to its speed degradation the moment you start adding more graphs. Again it's probably due to my lack of knowledge around some of the Prometheus functions for reducing the amount of data returned, but it would be nice if it could handle some of that automatically rather than just grinding to a halt.
On a basic level, yes, but I often just use it as a starting point for more complex gnuplot graphs, or different kinds of visualizations - box plots, histograms, etc.
I guess the https://github.com/prometheus/pushgateway could help with that? As for the query performance there's a lot of things you can do with recording rules, that might help a lot with speeding up dashboards or queries.
Yeah the pushgateway was the alternative to using InfluxDB. In the end we actually used Datadog for it, despite the cost, as it was just easier to scale on it (we had hundreds of devices per site). The pushgateway route with Prometheus just ended up feeling like there were too many things relying on each other, i.e. Prometheus -> Push Gateway <- Multiple agents on each device, is inherently more complex than just connecting directly to a DB/service from the device.
Try VictoriaMetrics next time. It supports data push via multiple popular data ingestion protocols [1] and it provides Prometheus-compatible API for Grafana [2].
https://sourcehut.org/blog/2020-07-03-how-we-monitor-our-ser...