> Most log messages are useless 99.99% of the time. Best likely outcome is that ...

jen20 · 2024-07-12T01:28:22.000000Z

It is quite unlikely that a regulator will ask you for proof you have a unit test for anything (also, that's not what a unit test is - see [1] for a good summary of why).

It _is_ likely a regulator will ask you to prove that you are developing within the quality assurance framework you have claimed you are, though.

Finally though, logs are not an audit trail, and almost no-one can prove their logs are correct with respect to the state of the system at any given time.

[1]: https://www.youtube.com/watch?v=EZ05e7EMOLM

KaiserPro · 2024-07-11T19:44:58.000000Z

> If it's not properly handled, it's also likely not properly logged

Then you're blue moon probability if it being useful rapidly drops. Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.

I am lucky enough to work at a place that has really ace logging capability, but, and I cannot stress this enough, it is colossally expensive. literal billions.

but, logging is not an audit trail. Even here where we have fancy PII shields and stuff, logging doesn't have the SLA to record anything critical. If there is a capacity crunch, logging resolution gets turned down. Plus logging anything of value to the system gets you a significant bollocking.

If you need something that you can hand to a government investigator, if you're pulling logs, you're already in deep shit. An audit framework needs to have a super high SLA, incredible durability and strong authentication for both people and services. All three of those things are generally foreign to logging systems.

Logging is useful, you should log things, but, you should not use it as a way to generate metrics. verbose logs are just a really efficient way to burn through your infrastructure budget.

zzyzxd · 2024-07-11T22:14:27.000000Z

> Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.

which is why this blog post brags about their capability. Technologies advances, and something difficult to do today may not be as difficult tomorrow. If your logging infra is overwhelmed, by all means drop some data and protect the system. But if Binance is happily storing and querying their 100PB logs now, that's their choice and it's totally fine. I won't say they are doing anything wrong. Again, we are talking about blue moon scenarios here, which is all about hedging risks and uncertainties. It's fine if Netflix drops a few frames of pictures in a movie, but my bank can't drop my transaction.

Wingy · 2024-07-11T19:59:36.000000Z

How about only save the verbose logs if there’s an error?

chhabraamit · 2024-07-11T20:02:34.000000Z

yup, nice idea. keep collecting logs in a flow and only log when there is an error. Or

Start logging in a buffer and only flush when there is an error.

growse · 2024-07-13T19:14:48.000000Z

I think this works well if you think about sampling traces not logs.

Basically, every log message should be attached to a trace. Then, you might choose to throw away the trace data based on criteria, e.g. throw away 98% of "successful" traces, and 0% of "error" traces.

The (admittedly not particularly hard) challenge then is building the infra that knows how to essentially make one buffer per trace, and keep/discard collections of related logs as required.

mnutt · 2024-07-11T23:52:05.000000Z

It sounds nice, but also consider: 1) depending on how your app crashes, are you sure the buffer will be flushed, and 2) if logging is expensive from a performance perspective, your base performance profile may be operating under the assumption that you’re humming along not logging anything. Some errors may beget more errors and have a snowball effect.

yepitsthat · 2024-07-12T06:56:31.000000Z

Both solved by having a sidecar (think of as a local ingestion point) that records everything (no waiting for flush on error), and then does tail sampling on the spans where status is non OK - i.e. everything thats non OK gets sent to Datadog, Baselime, your Grafana setup, your custom Clickhouse 100PB storage nodes. Or take your pick of any of 1000+ OpenTelemetry compatible providers. https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...

Pattern is the ~same.

yepitsthat · 2024-07-12T06:54:59.000000Z

You're nearly there. Tail sampling on non OK states.

https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...