Hacker News new | past | comments | ask | show | jobs | submit login

> Any system has a 'natural set' of metrics

I'm trying to offer the perspective of someone who works with products that don't exist entirely on a server. If your product is a web service, the following might not apply to you.

IME creating diagnostic systems for various IoT and industrial devices, the "natural" stuff is relatively easy to implement (battery level, RSSI, connection state, etc) but it's rarely informative. In other words, it doesn't meaningfully correlate with the health of the system unless failure is already imminent.

It's the obscure stuff that tends to be informative (routing table state, delivery ratio, etc). But, complex metrics demand a greater engineering lift in their development and testing. There's also a non-trivial amount of effort involved in developing tools to interpret the resulting data.

Even if natural and informative were tightly correlated, which they aren't, an informative metric isn't necessarily actionable. You have to be able to use the data to improve your product. I can't charge the battery in a customer's device for them. I also can't move their phone closer to a cell tower. If you can't act on a metric, you're just wasting your time.




Fine, but I'm now wondering what sort of "data" is going to help you "charge the battery in a customer's device for them [or] move their phone closer to a cell tower."

A natural metric for a distributed system is connectivity (or conversely partition detection). A metric on connectivity is informative. Can the information help you heal the partition? Maybe, maybe not. Time to hit the logs and see why the partition occurred and if an actionable remedy is possible.

(I'm trying to understand your pov btw, so clarify as you will.)


> I'm now wondering what sort of "data" is going to help you "charge the battery in a customer's device for them [or] move their phone closer to a cell tower."

None. The idea is that you have to think about what you'd actually do with that data once you've collected. If it's something that far-fetched, it isn't worth collecting that data. (This philosphy is also convenient for GDPR reasons)

Distributed systems are one place where metrics can be genuinely useful. They can be good at reducing the complexity of a bunch of interacting nodes down to something a bit more digestible. Distributed systems have their own fascinating technical challenges. One of the less-fascinating difficulties is that you're at the mercy of your client's IT. If they don't want their devices phoning home, you don't get real-time metrics. You might be able to store some stuff for offline diagnostic purposes, but other practical limits arise from there.

How do you detect partitions? You could have each device periodically record a snapshot of its routing table, but then if you wanted to identify the partition, you'd have to go fetch data from each node individually. So, maybe you have them share their routing tables with each other, thereby allowing the partition detection to happen on the fly. That's great, but now you're using precious, precious bandwidth hurling around diagnostic data that you might not even be able to access in practice. There's really no right answer here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: