Hacker News new | past | comments | ask | show | jobs | submit login

> Most log messages are useless 99.99% of the time. Best likely outcome is that its turned into a metric. The once in the blue moon outcome is that it tells you what went wrong when something crashed.

If it crashes, it's probably some scenario that was not properly handled. If it's not properly handled, it's also likely not properly logged. That's why you need verbose logs -- once in a blue moon you need to have the ability to retrospectively investigate something in the past that was not thought through, without using a time machine.

This is more common in the financial world where audit trail is required to be kept long term for regulation. Some auditor may ask you for proof that you have done a unit test for a function 3 years ago.

Every organization needs to find their balance between storage cost and quality of observability. I prefer to keep as much data as we are financially allowed. If Binance is happy to pay to store 100PB logs, good for them!

"Do we absolutely need this data or not" is a very tough question. Instead, I usually ask "how long do we need to keep this data" and apply proper retention policy. That's a much easier question to answer for everyone.




It is quite unlikely that a regulator will ask you for proof you have a unit test for anything (also, that's not what a unit test is - see [1] for a good summary of why).

It _is_ likely a regulator will ask you to prove that you are developing within the quality assurance framework you have claimed you are, though.

Finally though, logs are not an audit trail, and almost no-one can prove their logs are correct with respect to the state of the system at any given time.

[1]: https://www.youtube.com/watch?v=EZ05e7EMOLM


> If it's not properly handled, it's also likely not properly logged

Then you're blue moon probability if it being useful rapidly drops. Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.

I am lucky enough to work at a place that has really ace logging capability, but, and I cannot stress this enough, it is colossally expensive. literal billions.

but, logging is not an audit trail. Even here where we have fancy PII shields and stuff, logging doesn't have the SLA to record anything critical. If there is a capacity crunch, logging resolution gets turned down. Plus logging anything of value to the system gets you a significant bollocking.

If you need something that you can hand to a government investigator, if you're pulling logs, you're already in deep shit. An audit framework needs to have a super high SLA, incredible durability and strong authentication for both people and services. All three of those things are generally foreign to logging systems.

Logging is useful, you should log things, but, you should not use it as a way to generate metrics. verbose logs are just a really efficient way to burn through your infrastructure budget.


> Verbose logs are simply a pain in the arse, unless you have a massive processing system. but even then it just either kneecaps your observation window, or makes your queries take ages.

which is why this blog post brags about their capability. Technologies advances, and something difficult to do today may not be as difficult tomorrow. If your logging infra is overwhelmed, by all means drop some data and protect the system. But if Binance is happily storing and querying their 100PB logs now, that's their choice and it's totally fine. I won't say they are doing anything wrong. Again, we are talking about blue moon scenarios here, which is all about hedging risks and uncertainties. It's fine if Netflix drops a few frames of pictures in a movie, but my bank can't drop my transaction.


How about only save the verbose logs if there’s an error?


yup, nice idea. keep collecting logs in a flow and only log when there is an error. Or

Start logging in a buffer and only flush when there is an error.


I think this works well if you think about sampling traces not logs.

Basically, every log message should be attached to a trace. Then, you might choose to throw away the trace data based on criteria, e.g. throw away 98% of "successful" traces, and 0% of "error" traces.

The (admittedly not particularly hard) challenge then is building the infra that knows how to essentially make one buffer per trace, and keep/discard collections of related logs as required.


It sounds nice, but also consider: 1) depending on how your app crashes, are you sure the buffer will be flushed, and 2) if logging is expensive from a performance perspective, your base performance profile may be operating under the assumption that you’re humming along not logging anything. Some errors may beget more errors and have a snowball effect.


Both solved by having a sidecar (think of as a local ingestion point) that records everything (no waiting for flush on error), and then does tail sampling on the spans where status is non OK - i.e. everything thats non OK gets sent to Datadog, Baselime, your Grafana setup, your custom Clickhouse 100PB storage nodes. Or take your pick of any of 1000+ OpenTelemetry compatible providers. https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...

Pattern is the ~same.


You're nearly there. Tail sampling on non OK states.

https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: