How about only save the verbose logs if there’s an error?

chhabraamit · 2024-07-11T20:02:34.000000Z

yup, nice idea. keep collecting logs in a flow and only log when there is an error. Or

Start logging in a buffer and only flush when there is an error.

growse · 2024-07-13T19:14:48.000000Z

I think this works well if you think about sampling traces not logs.

Basically, every log message should be attached to a trace. Then, you might choose to throw away the trace data based on criteria, e.g. throw away 98% of "successful" traces, and 0% of "error" traces.

The (admittedly not particularly hard) challenge then is building the infra that knows how to essentially make one buffer per trace, and keep/discard collections of related logs as required.

mnutt · 2024-07-11T23:52:05.000000Z

It sounds nice, but also consider: 1) depending on how your app crashes, are you sure the buffer will be flushed, and 2) if logging is expensive from a performance perspective, your base performance profile may be operating under the assumption that you’re humming along not logging anything. Some errors may beget more errors and have a snowball effect.

yepitsthat · 2024-07-12T06:56:31.000000Z

Both solved by having a sidecar (think of as a local ingestion point) that records everything (no waiting for flush on error), and then does tail sampling on the spans where status is non OK - i.e. everything thats non OK gets sent to Datadog, Baselime, your Grafana setup, your custom Clickhouse 100PB storage nodes. Or take your pick of any of 1000+ OpenTelemetry compatible providers. https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...

Pattern is the ~same.

yepitsthat · 2024-07-12T06:54:59.000000Z

You're nearly there. Tail sampling on non OK states.

https://opentelemetry.io/docs/concepts/sampling/#tail-sampli...