A huge problem I'm having with graphite right now (which is making me look at in...

SEJeff · on Feb 4, 2015

We are in the final stages of the last 0.9.x release, 0.9.13. From then on, we're going to be making some more noticeable changes that break some backwards compat to make the project a lot more pleasant.

Note that 0.9.13 is almost ready to be cut: https://github.com/graphite-project/graphite-web/commit/7862...

https://github.com/graphite-project/carbon/commit/e69e1eb59a...

https://github.com/graphite-project/whisper/commit/19ab78ad6...

Anything in the master branch is what will be in 0.10.0 when we're ready to cut that. I think we'll spend some more cycles in 0.10.x focusing on non-carbon / non-whisper / non-ceres backends that should allow much better scalability. Some of these include cassandra, riak, etc.

For it timing out, it is a matter of general sysadmin spleunking to figure out what is wrong. It could be IO on your carbon caches, or CPU in your render servers (where it uses cairo). I'm a HUGE fan of grafana for doing 100% of the dashboards and only using graphite-web to spit out json, or alternatively to use graphite-api.

Take a look at the maxDataPoints argument to see if that will help your graphs to not timeout however.

mdeeks · on Feb 4, 2015

My brief experience with browser based rendering was not good. Our dashboard pages often have 40-50+ graphs for a single cluster. I found it brought all browsers to a crawl and turned our laptops into blazing infernos when viewing longer timelines. Granted I didn't try out graphana so it could have been related to badly optimized javascript in the ones I tried.

CPU on the render servers is low. IO on the carbon caches is acceptable (10k IOPS on SSDs that support up to 30k or so). If the CPU Usage Type graph would render it would show very little IO Wait (~5%). Graphs if you're interested: http://i.imgur.com/dCrDynY.png

Anyway thanks for the response. I'll keep digging. Looking forward to that 0.9.13 release!

SEJeff · on Feb 4, 2015

maxDataPoints was a feature added by the guy who wrote giraffe[1], which is for realtime dashboards from graphite. It was too slow until he added in the maxDataPoints feature, and now it is actually really awesome when setup properly.

Also look at graphite-api[2], written by a very active graphite committer. It is api only (only json), but absolutely awesome stuff. Hook it up to grafana for a real winner.

[1] http://giraffe.kenhub.com/#dashboard=Demo&timeFrame=1d

[2] https://github.com/brutasse/graphite-api

bbrazil · on Feb 4, 2015

For comparison I tried out >1k cpu plots in Prometheus on a m1.large with 2xHDDs. It took 20s with a cold cache.

mdeeks · on Feb 4, 2015

I'd be interested in hearing how it performs when rendering a huge page of graphs each with dozens to hundreds of graph lines.

Unfortunately though Prometheus lacks easy horizontal scaling just like Graphite. It sounds like Prometheus is worse actually since it mentions manual sharding rather than consistent hashing that Graphite does. This rules out Prometheus as an alternative to Graphite for me even if it does render complex graphs better. I'm definitely keeping my eye on this one though.

bbrazil · on Feb 4, 2015

> huge page of graphs each with dozens to hundreds of graph lines

From experience that much data on a page makes it quite difficult to comprehend, even for experts. I've seen hundreds of graphs on a single console, which was completely unusable. Another had ~15 graphs, but it took the (few) experts many minutes to interpret them as it was badly presented. A more aggregated form with fewer graphs tends to be easier to grok. See http://prometheus.io/docs/practices/consoles/ for suggestions on consoles that are easier to use.

> It sounds like Prometheus is worse actually since it mentions manual sharding rather than consistent hashing that Graphite does.

The manual sharding is vertical. That means that a single server would monitor the entire of a subsystem (for some possibly very broad definition of subsystem). This has the benefit that all the time series are in the same Prometheus server, you can use the query language to efficiently do arbitrary aggregation and other math for you to make the data easier to understand.

mdeeks · on Feb 4, 2015

It depends on the use case. For high level overview of systems you absolutely want fewer graphs. Agreed there. For deep dives into "Why is this server/cluster running like crap!?" having much more (all?) of the information right there to browse makes a big difference. I originally went for fewer graphs separated into multiple pages you had to navigate to and no one liked it. In the end we adopted both philosophies for each use case.

Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.

RE: Sharding What happens when a single server can no longer hold the load of the subsystem? You have to shard that subsystem further by something random and arbitrary. It requires manual work to decide how to shard. Once you have too much data and too many servers that need monitoring, manual sharding becomes cumbersome. It's already cumbersome in Graphite since expanding a carbon-cache cluster requires moving data around since the hashing changes.

bbrazil · on Feb 4, 2015

> having much more (all?) of the information right there to browse makes a big difference.

I think it's important to have structure in your consoles so you can follow a logical debugging path, such as starting at the entry point of our queries, checking each backend, finding the problematic one, going to that backend's console and repeating until you find the culprit.

One approach console wise is to put the less important bits in a table rather than a graph, and potentially have further consoles if there were subsystems complex and interesting enough to justify it.

I'm used to services that expose thousands of metrics (and there's many more time series when labels are taken into account). Having everything on consoles with such rich instrumentation simply isn't workable, you have to focus on what's the most useful. At some point you're going to end up in the code, and from there see what metrics (and logs) that code exposes, ad-hoc graph them and debug from there.

> Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.

Agreed, scatterplots are also pretty useful when analysing that sort of issue. Is it that the servers are efficient, or are they getting less load? A qps vs cpu scatterplot will tell you. To find such imbalances in the first place, taking a normalized standard deviation across all of your servers is handy - which is the sort of thing Prometheus is good at.

> You have to shard that subsystem further by something random and arbitrary.

One approach would be to have multiple prometheus servers with the same list of targets, and configured to do a consistent partition between them. You'd then need to do an extra aggregation step and get the data from the "slave" prometheus servers up to a "master" prometheus via federation. This is only likely to be a problem when you hit thousands of a single type of server, so the extra complexity tends to be manageable all things considered.