> Any system has a 'natural set' of metrics I'm trying to offer the perspective ...

temporarely · 2024-07-11T20:39:19.000000Z

Fine, but I'm now wondering what sort of "data" is going to help you "charge the battery in a customer's device for them [or] move their phone closer to a cell tower."

A natural metric for a distributed system is connectivity (or conversely partition detection). A metric on connectivity is informative. Can the information help you heal the partition? Maybe, maybe not. Time to hit the logs and see why the partition occurred and if an actionable remedy is possible.

(I'm trying to understand your pov btw, so clarify as you will.)

ryukoposting · 2024-07-12T02:26:55.000000Z

> I'm now wondering what sort of "data" is going to help you "charge the battery in a customer's device for them [or] move their phone closer to a cell tower."

None. The idea is that you have to think about what you'd actually do with that data once you've collected. If it's something that far-fetched, it isn't worth collecting that data. (This philosphy is also convenient for GDPR reasons)

Distributed systems are one place where metrics can be genuinely useful. They can be good at reducing the complexity of a bunch of interacting nodes down to something a bit more digestible. Distributed systems have their own fascinating technical challenges. One of the less-fascinating difficulties is that you're at the mercy of your client's IT. If they don't want their devices phoning home, you don't get real-time metrics. You might be able to store some stuff for offline diagnostic purposes, but other practical limits arise from there.

How do you detect partitions? You could have each device periodically record a snapshot of its routing table, but then if you wanted to identify the partition, you'd have to go fetch data from each node individually. So, maybe you have them share their routing tables with each other, thereby allowing the partition detection to happen on the fly. That's great, but now you're using precious, precious bandwidth hurling around diagnostic data that you might not even be able to access in practice. There's really no right answer here.