Discussion: structured, leveled logging

reggieband · on Sept 11, 2022

I see some asking for stack support in the logs, and that is something I would be looking for. Often when an error happens on a server it might be inside some generic utility. At least with a stack, you might realize that all of the calls to that utility function are coming from the same location a few steps up the stack. In my experience that can significantly speed up identifying the root cause of issues.

I'm not sure I like the way they are handling attributes either. One pattern I strongly discourage on projects I work on is formatting data into the log message. When I've had access to decent log aggregation tools (e.g. Splunk, Datadog) I like to generate reports on error messages. That is usually possible with regex magic but in my experience it is much much cleaner to have a static unchanging message for all messages. I then store all additional details in a separate part of the log line. I tend to prefer JSON for those details since it is well supported and even parsed by default by many tools. Another benefit of JSON for those additional details is that it is supports arrays and hash-tables. The attribute system proposed here doesn't seem to have that same hierarchical attribute support.

IMO, unified, clean and consistent logging is one of the most underrated techniques for maintaining long running services.

suralind · on Sept 12, 2022

This is what annoys me in my current job - every service uses unstructured logging. We use a couple popular loggers in Go and still, people put values in messages rather than just make the message static and put all variables into, well, variables.

XorNot · on Sept 12, 2022

These days I kind of wish we didn't do text logging at all. There's essentially two types of logs: fixed messages, and values.

A binary logging protocol can make this definition explicit and more efficient: you're either emitting a token for the fixed message in your applications format, or you're emitting values...and a fixed token for the format string, which can be parsed and reconstructed later.

Our applications don't need to be doing this sort of text formatting in them at all, and we need the developer interface to make explicit what's happening and what's important.

Waterluvian · on Sept 12, 2022

All I ever want to know is what line of code in MY code is in the stack trace. Maybe I’m doing it wrong but this is a regular pain point for me.

Sometimes I’ll even get Python or JS traces that don’t even list a single line of my own code.

noisy_boy · on Sept 12, 2022

I think a lot of people want what you mentioned. But I also think that even if stack support is implemented, they will "keep it simple" and pass the responsibility of extracting the location/details of your code from the stack trace to, you. That will give rise to everyone writing there own variety of utility/common stack parsing methods and then libraries will spring-up that provide such conveniences and it'll go on to become its own thing.

I feel like that is Go's underlying theme: "we don't provide last-mile-connectivity; how else will the ecosystem grow?!"

jayd16 · on Sept 12, 2022

Isn't letting the log library do the formatting a common practice? That way you can get formatted logs and the library can cluster logs by the format string instead of the final output.

reggieband · on Sept 12, 2022

I don't care about the formatting, I care about the available data types. JSON is just easy to use since it includes arrays/hash-tables and it is already supported by a large number of tools.

If you look at the Attr type definition you can see they support many primitive types including Int, Float, Bool, String, Duration, and Time. So if I had a record type, like a customer I could do:

    slog.LogAttrs(slog.ErrorLevel, "oops",
        slog.Int("customer_id", 123456), 
        slog.Float("customer_balance", 12.42)
        slog.Time("customer_updated_at", ...))

But I would prefer a structured data type there. Something more like:

    { "customer": {
        "id": 123456,
        "balance": 12.42
        "updated_at": ...
        }
    }

In fact, I'm not sure how I'd go about supporting arrays as Attr except with keys like "item_0", "item_1", etc. Or maybe serialize it into a representation like comma separated values. But now I'm coupling my log to a custom serialization/deserialization - I'd rather just use JSON once again since most tools will know how to handle it.

glenjamin · on Sept 11, 2022

I’m increasingly of the opinion that we should stop using “logging” libraries, and instead we should produce telemetry via events using something like OpenTelemetry’s trace API.

These traces can be rendered as strings for people who like their logs in files

chrsig · on Sept 11, 2022

I'm inclined to agree.

Traditional logging has a number of problems. Developers tend to just stick `log.Info` (or whatever equivalent) anywhere.

Using an http server as an example:

The result is that one http request may produce dozens of log messages. But because multiple http requests are running at the same time, they all get interleaved in the actual log.

They can be tied together after the fact if you're using a request id, but that requires post processing. It can be difficult to know when a request ends, so that means there needs to be an explicit "end of request" type message, otherwise assume that it could still be active until the end of the file...unless your log is rotating, in which case it could be in the next file, or the next.

On top of that, it's a big performance hit. On every log message, you're serializing the message, acquiring a mutex, flushing it out to disk. Depending on the library, some of that may or may not get evaluated depending on message severity. It's a non-trivial overhead, and individual messages can cause quite a bit of contention for that mutex.

Using something more like tracing, a log message gets attached to a context for the task, and it's not until the task is complete does anything get logged. This makes for a single log message that has a list of all the individual informational messages, as well as other metadata about the request. One mutex acquisition, one serialization, one file flush.

It's at that point where tracing and logging are arguably the same thing.

xani_ · on Sept 11, 2022

> Using something more like tracing, a log message gets attached to a context for the task, and it's not until the task is complete does anything get logged. This makes for a single log message that has a list of all the individual informational messages, as well as other metadata about the request. One mutex acquisition, one serialization, one file flush.

That's kinda terrible design. You don't see anything in progress, only if it finishes, and anything in progress just disappears, never to be seen again, if app crashes.

Splitting to event per action in a given context is a feature, not a bug

jiggawatts · on Sept 12, 2022

Not necessarily, this is an implementation detail.

Azure Application Insights for example streams events to a log file asynchronously. A crash is typically captured as a memory snapshot crash dump anyway, so if the entire process terminates then you have something that's probably superior to a text log!

If you open the crash dump, you can inspect the in-memory buffer of log events directly in a debugger.

If the operation fails internally with an exception, then the logs are captured as normal including the exception details.

Asynchronous logging is dramatically more efficient. Or to put it another way, synchronous logging often causes performance bottlenecks that surprises developers.

If you do want "live" logs equivalent to viewing the tail of a text file, most systems (such as App Insights mentioned above) can dynamically switch from async to sync mode and stream all events second-by-second.

wokkel · on Sept 11, 2022

Just look into log4j: mdc/ndc to include context (user/session), include the thread name in your pattern and use an async appender to take care of performance. All of these things have been solved already.

wsargent · on Sept 12, 2022

MDC/NDC only reliably works when you don't have asynchronous code (Akka/Vert.x/CompletionStage) through your application. As soon as you start using multiple threads, it becomes significantly more complex to keep MDC context carried over from one thread to the next.

kaba0 · on Sept 12, 2022

Well, that has more to do with reactive programming not playing by the rules. Once Loom arrives with Extent-Local scopes it will solve it very well.

chrsig · on Sept 11, 2022

> All of these things have been solved already.

I mean...yes, that's how I was able to highlight the benefits of having a single log message per request.

> Just look into log4j: mdc/ndc to include context (user/session), include the thread name in your pattern and use an async appender to take care of performance.

At a glance this looks like it will still produce 1 log line per logging call, versus buffering with the request. adding the thread id just helps stitch them together after the fact. This doesn't get rid of any performance overhead -- it may get it out of the way of a user request, but it's still there under the hood.

In the go case, using the runtime/trace library as an example[0], a `context.Context` is passed down the call stack which gets decorated with trace spans/regions/log messages on the way down. At the end of a task, all of the collected context gets written to the active trace log.

[0] https://pkg.go.dev/runtime/trace

wokkel · on Sept 11, 2022

I like my log to include where something happened and in some complex cases having a few thousand lines of logging info per request is not unreasonable. So stuffing al that info into a single log event/line doesn't make sense to me. I recently had the unfortunate situation where two seemingly independant sessions were interfering, here a sequential logfile where things were logged in the order occurring was the only way to correlate them. Even though we do telemetry via events, old school logfiles are a godsend to debug the complex cases.

nine_k · on Sept 11, 2022

Logging to syslog already solves the problem of asynchronous / streamed logging.

Structuring log messages better than an arbitrary text string is harder (even though I heard the widespread systemd offers a way). Mostly it's because it's hard to agree upon a universal standard. Logging in a self-describing structure is basically the same as logging in JSON (which is not unheard of). Maybe you could do a bit better logging in protobuf or thrift records.

chrsig · on Sept 11, 2022

Syslog doesn't solve any of the problems outlined above - you're just trading a file for a unix domain socket. And it introduces it's own problems, such as sharing that socket with the containers (if you just mount /dev/log into the container, you're gonna have a bad time when the syslog daemon restarts). Log rotation also becomes a problem at higher volumes. Want a 5 minute rotation? logrotate isn't of much help.

re: logging in protobuf/thrift. Thrift is more well suited towards it due to it's stream friendliness. With protobufs, you can do better performance wise than json, but does tend to be a bit more annoying because you need to write a length header between messages. (at this point, I've considered logging directly into a `.tar` file with one file per request. I haven't ever done it, and wouldn't in production, but it's an intriguing thought to me.) With either format, you have to deal with the fact that they're now binary logs, and you lose out on the ability to use a number of tools like grep & awk, or you have some program to cat the log as plain text.

Logging to json, while it's a bit more overhead, offers access to a lot of tooling which is lost with binary logging or unstructured logging. Athena, snowflake, sqlite, kafka/ksql, jq, it's incredibly ELK friendly...The list goes on. In general, it's usable without the message definitions, so it can be much easier to use.

All of these tradeoffs wildly depend on what it is that you're logging and what your needs are.

aprdm · on Sept 12, 2022

You can have syslog emit json

r3trohack3r · on Sept 11, 2022

> Mostly it's because it's hard to agree upon a universal standard

This is one of the things I’ve appreciated in my corner of the Nodejs ecosystem. It’s fairly standardized around pino and bunyan. With async hooks you can instrument all downstream async APIs too and associate them with the incoming request. So, by default, in my services all logs are associated with their incoming requests.

That being said - I’m inclined to agree that, as the distributed tracing ecosystem becomes increasingly mature, it really is eating the use case of logs. The real trick with tracing is being able to associate a spike in latency for 5xx requests in metrics with all their associated traces. With logs I can aggregate all 5xx logs, filter by latency, and get the subset of logs for the high latency requests to dig in. In more mature stacks, I can even conditionally increase log sampling on-demand for those requests to get a better view into the problem (especially helpful for rare events that are unlikely to make it past the sampler).

I’m not good enough with traces yet to do this. Today I have to start with logs to find the request IDs to get to the traces.

dfee · on Sept 11, 2022

My God. Logging in protobuf?

Logging is the lowest of all debugging utilities - its the first thing you ever do writing software - “hello world”. And, while I admire structural logging, the truth is printing strings remains (truly) the lowest common denominator across software developers.

I often see confusion on how to interpolate variables - let alone zero standardization on field formatting, and poorly managed logging infra, too.

Adding another barrier to entry (protobufs) isn’t going to drive better usage. Any solution I could prescribe would inevitably chop off the long tail of SWE (raise the common denominator), and I think that’s going to be quite an unpopular position to advance in any established company.

To be clear: structure is good, our expectations of how to structurally log are too low, and introducing a build step to compile PB to native, and then dealing with a poorly generated (my opinion) API to log, sounds like a miserable experience.

arinlen · on Sept 11, 2022

> My God. Logging in protobuf?

Yes, or any other data format and/or transport protocol.

I'm surprised this is up for debate.

> Logging is the lowest of all debugging utilities - its the first thing you ever do writing software - “hello world”. And, while I admire structural logging, the truth is printing strings remains (truly) the lowest common denominator across software developers.

This sort of comment is terribly miopic. You can have a logging API, and then configure your logging to transport the events anywhere, any way. This is a terribly basic feature and requirement, and one that comes out of the box with some systems. Check how SLF4J[1] is pervasive in Java, and how any SLF4J implementation offers logging to stdout or a local file as a very specific and basic usecase.

It turns out that nowadays most developers write software that runs on many computers that aren't stashed over or under their desks, and thus they need efficient and convenient ways to check what's happening either in a node or in all deployments.

[1] https://www.slf4j.org/

wsargent · on Sept 12, 2022

> I'm surprised this is up for debate.

I looked into logging in protobuf when I was seeing if there was a better binary encoding for ring-buffer logging, along the same lines as nanolog:

https://tersesystems.com/blog/2020/11/26/queryable-logging-w...

What I found was that it's typically not the binary encoding vs string encoding that makes a difference. The biggest factors are "is there a predefined schema", "is there a precompiler that will generate code for this schema", and "what is the complexity of the output format". With that in mind, if you are dealing with chaotic semi-structured data, JSON is pretty good, and actually faster than some binary encodings:

https://github.com/eishay/jvm-serializers/wiki/Newer-Results...

dfee · on Sept 12, 2022

Yes, I have (and currently do) use slf4j. It’s not a panacea. Maybe you also write software that services millions of MAUs with engineering teams spread across the world - and seen immensely different results. That’d be a breath of fresh air.

However, as a stickler for quality, I find myself on the wrong end of this discussion in practice. I insist, logging is truly the LCD of software development. Introducing unergonomic data structures generated from protobufs strikes me as something that the long tail would ignore, and instead simply reach for something simpler to instead log `here`, `here1`, etc.

GeneralMayhem · on Sept 11, 2022

> But because multiple http requests are running at the same time, they all get interleaved in the actual log.

Many logging libraries will export a structured preamble including the timestamp and PID, which is enough to reconstruct single-threaded time histories when you want them, while still allowing you to see cross-thread time-correlated issues when you need that as well.

kitd · on Sept 12, 2022

Unfortunately async code often blows that idea up. Cf. Node apps where the log comes from some arbitrary callback with little to no indication of where the root cause occurred.

KronisLV · on Sept 11, 2022

> It's at that point where tracing and logging are arguably the same thing.

On some level I agree with this, but honestly tracing doesn't have nearly the same adoption that logging has (at least in my experience).

Why? Because generally it's easier to approach logging, even if in ways that don't necessarily scale that well: like getting your application writing logs to a file on the server directory.

Note: I mention Java here, but a lot of it also applies to Node/Python/Go/Ruby etc.

Eventually they can be moved over to a networked share, have Logrotate running against them, or perhaps have proper log shipping in place. That is perhaps the hardest aspect, since for tracing (outside of something like JavaMelody which is integrated inside of your app) you'll also need to set up some sort of a platform for the logs to be shipped to and processed in any number of ways.

Personally, I've found that something like Apache Skywalking is what many who want to look into it should consider, a setup utilizing which consists approximately of the following:

  - some sort of a database for it (ElasticSearch recommended, apparently PostgreSQL/MariaDB/MySQL viable)
  - the server application which will process data (can run as an OCI container)
  - the web application if you need an interface of this sort (can run as an OCI container)
  - an agent for your agent of choice, for example a set of .jar files for Java which can be setup with -javaagent
  - optionally, some JS for integrating with your web application (if that's what you're developing)

Technically, you can also use Skywalking for log aggregation, but personally the setup isn't as great and their log view UI is a bit awkward (e.g. it's not easy to preview all logs for a particular service/instance in a file-like view), see the demo: https://skywalking.apache.org/

For logs in particular, Graylog feels reasonably sane, since it has a similarly "manageable" amount of components, for a configuration example see: https://docs.graylog.org/docs/docker#settings

Contrast that to some of the more popular solutions out there, like Sentry, which gets way more complicated really quickly: https://github.com/getsentry/self-hosted/blob/master/docker-...

For most of the people who have to deal with self-hosted setups where you might benefit from something like tracing or log shipping, actually getting the platform up and running will be an uphill battle, especially if not everyone sees the value in setting something like this up, or setting aside enough resources for it. Sometimes people will be more okay with having no idea why a system goes down randomly, rather than administering something like this constantly and learning new approaches, instead of just rotating a bunch of files.

For others, there are no such worries, because they can open their wallets (without worrying about certain regulations and where their data can be stored, hopefully) and have some cloud provider give them a workable solution, so they just need to integrate their apps with some agent for shipping the information.

For others yet, throwing the requirement over to some other team who's supposed to provide such platform components for them is also a possibility.

phillipcarter · on Sept 11, 2022

> Why? Because generally it's easier to approach logging

That's due to a significant head start. When one practice has more than a decade to mature by spreading across an ecosystem and becoming more ergonomic, it's definitely going to be easier to adopt.

However, tracing as come a LONG way with OpenTelemetry even in the past year. There's still a big mountain to climb, but making tracing better/more widespread and providing on-ramps to metrics and logging systems is exciting.

forrestthewoods · on Sept 12, 2022

What’s the difference between logging and tracing?

phillipcarter · on Sept 12, 2022

Logs come in two forms: structured and unstructured. Throw your unstructured logs in the garbage bin if you can - they do very little to help systematically analyze systems.

Structured logs are logs that adhere to a schema and let you attach key:value pairs of additional metadata, so you can add context from your app to those logs. They're extremely helpful for systematically debugging systems, but they're not correlated with anything. With only logs, you need to divine what led to such a log yourself, or manually re-create tracing within your logging system.

Traces are basically structured logs but have correlation IDs that let a backend stitch them together. Tracing libraries also offer a lot more than just this, but it's the core of tracing. I like to think that tracing is just an objectively better form of structured logging.

tptacek · on Sept 11, 2022

The otel trace API is nowhere near ergonomic enough to replace logging. I think people mostly tolerate otel because the idiomatic way to use it is just a funcall and a defer at the top of a subset of your functions, and because if you're otel-instrumenting, you're probably already passing a context.

https://pkg.go.dev/go.opentelemetry.io/otel/trace

phillipcarter · on Sept 11, 2022

FWIW tracing in Go is primarily made more difficult by having explicit context propagation and no helper to auto-close a span based on scope. If you use Python or .NET for example, it's implicit and you can "just create some spans" like you would a structured log somewhere in your app, then use built-in language constructs to have it handle closing for you.

tptacek · on Sept 11, 2022

Sure. If it was a different language, or a different otel library, my opinion would probably be different. I do not think it's realistic to replace logging with otel in Go, though.

phillipcarter · on Sept 11, 2022

I think it's an inclusive OR here. When logging stabilizes in the Go SDK (which will be a while, perhaps a year) the idea will be that you can keep all your existing logging and then start to add tracing judiciously so they're easily correlated with requests. I think the juice is worth that particular squeeze for more folks at least.

From there I would personally advocate for adding more manual traces rather than more logs, but it's understandable if it's a heavier lift to pass context around all the time. The more options the better.

glenjamin · on Sept 11, 2022

For web-style application code my experience writing Go code says that you totally can stop using logs and just use traces and have just as much visibility into a running process, if not more.

For library code or other deployment models I’m not sure - it’s not something I’ve had a need to experiment with

Rapzid · on Sept 11, 2022

.Net has task-local storage which is pretty amazing and Seri log made great use of it(my all-time fav structured log lib).

But iirc task local is a perf concern, and JavaScript has yet to have anything comparable that's ubiquitous(nodejs async hooks though so one day maybe).

So, explicit contexts for Go and JavaScript unfortunately.

phillipcarter · on Sept 11, 2022

JS does have convenience APIs in the form of startActiveSpan and getActiveSpan that take care of most of the pains here. There's likely some scenarios where you can't use these (e.g., instrumenting a tool must use the sdk-trace-base package) but I think in most cases they help a lot. Go doesn't have these kinds of convenience APIs though.

xani_ · on Sept 11, 2022

...the helper like WithCancel/WithDeadline ?

enobrev · on Sept 12, 2022

What annoys me about otel is that it seems to require direct live connections. I'd much rather just add the tracking keys / spans / whatever to my logs and parse them out of there later, but no standard lib / app / process seems to like that idea. I have no intention of adding extra sockets to my requests when I can just dump everything to the logs as JSON.

So that's what I do, but instead of relying on otel tools, I have custom queries I run against elasticsearch (or grep syslog) using my trace tags. Works great. No multi-process visualizations but i can make my way through a process chain quickly enough

wdb · on Sept 12, 2022

That's why you have otel logging: https://github.com/open-telemetry/opentelemetry-specificatio...

henvic · on Sept 11, 2022

I've played with trying to create an idiomatic wrapper for their Go module precisely because of this reason, but then didn't find time to continue with this.

https://github.com/henvic/tel

arinlen · on Sept 11, 2022

> I’m increasingly of the opinion that we should stop using “logging” libraries, and instead we should produce telemetry via events using something like OpenTelemetry’s trace API.

That really depends on what you mean by "events". Even OpenTelemetry supports logging events. Logging is not replaced by emitting metrics, as they have very distinct responsibilities.

> These traces can be rendered as strings for people who like their logs in files

You're missing the whole point of logging stuff. The point of logging things is not to have stuff show up in files. Logging events are the epitome of observability, and metrics and tracing events end up specializations of logging events which only cover very precise and specific usecase.

glenjamin · on Sept 11, 2022

> Logging events are the epitome of observability, and metrics and tracing events end up specializations of logging events which only cover very precise and specific usecase.

This is a common way of thinking about things, but having now worked with event-based tracing for the last few years I no longer believe traditional logs are necessary.

If you do “good” logging you likely have entirely structured data and no dynamic portions of the main log message. This is almost exactly equivalent to a span as represented by tracing implementations.

I have come to believe that these events are the core piece of telemetry, and from them you can derive metrics, spans, traces and even normal-looking logs.

wsargent · on Sept 12, 2022

I see it the other way around: tracing is essentially logging with hierarchy. If you can keep the context of a "parent" span around, then you can log all of the entries out and build up the entire trace from spans, although you can get pretty confused if you don't write them out in the correct order.

However, if you don't have context, then the log entry is the atomic unit -- metrics are log entries with numbers that can be aggregated, spans are log entries with a trace id and a parent span, and "events" in Honeycomb terminology are logs with attributes covering an entire business operation / HTTP request.

arinlen · on Sept 11, 2022

> This is a common way of thinking about things, but having now worked with event-based tracing for the last few years I no longer believe traditional logs are necessary.

You are free to use stuff as you see fit, even if you intentionally miss out on features.

Logging is a very basic feature whose usecases are not covered by metrics events, or even traces. Logs are used to help developers troubleshoot problems by giving them an open format to output data that's relevant to monitor specific aspects of a system's behavior.

It makes no sense to emit a metrics event with tons of stuff as dimensions and/or annotations that's emitted in a single line of code that's executed by a single code path when all you need is a single log message.

Also, stack traces are emitted as logging events, and metrics events are pretty useless at tracking that sort of info.

But hey, to each its own. If you feel that emitting a counter or a timer gives you enough info to check why your code caught an exception and what made it crash, more power to you. The whole world seems to disagree, given standard logging frameworks even added support for logging stack traces as structured logs as a basic feature.

glenjamin · on Sept 11, 2022

A counter or a timer is clearly no substitute for a log, that would be an absurd statement for me to have made - which is why I said no such thing.

A span as modelled by OpenTelemetry is a a bag of developer-chosen structured data containing a name, a timestamp and some auto-generated IDs to correlate it with others.

It is effectively an open standard for how to collect good structured logs - although it is currently not commonly pitched that way by either tracing advocates or structured logging advocates.

These are commonly used to create trace waterfalls, but can also be used to derive metrics or simple dev-debug output.

I have used this approach myself to great effect, and also spoken to a number of people who have also found success with such an approach - but currently the momentum of existing approaches (and the fact that traditional logging still works well-enough for most people) has not led to wide adoption

rcarmo · on Sept 11, 2022

Many tracing libraries get their data off logging libraries. Logging doesn't always mean "whoa, let's stop everything and write to a file", it is often a matter of buffering data or increasing counters until data can be sent out in various fashions.

Log files are just one of those.

glenjamin · on Sept 11, 2022

The key difference is my experience is not about destination - backends are as flexible as you say

One key difference is having an event name + properties, rather than a human-readable description and maybe some fields if you’re lucky

Another key difference is wrapping units of work rather than just scattering log lines in places you think need them somewhat arbitrarily

jayd16 · on Sept 11, 2022

>One key difference is having an event name + properties [...] Another key difference is wrapping units of work [...]

Modern "enterprise" logging libraries already work this way, don't they?

rcarmo · on Sept 11, 2022

Most decent logging frameworks allow for structured output. That change came about roughly when NewRelic became "the thing" to use.

markmark · on Sept 11, 2022

You probably need to look into modern logging if you think that's how logging still works.

jayd16 · on Sept 11, 2022

I guess Golang is catching up but in something like C# or Java, the logging libraries already feel like tracing libraries. They already have contexts and structure.

What are you proposing should change? Log files should default to a verbose binary representation that captures more? What do you think is not captured by today's logging?

phillipcarter · on Sept 11, 2022

Tracing gives you correlation between "log lines" and across process/network boundaries by propagating context for you. You could replicate everything that tracing does by manually propagating around context-like metadata yourself, stitching the log lines together into a "trace". But why not just use an SDK that does that for you?

jbjbjbjb · on Sept 11, 2022

It is a config away in C# world, and even when it wasn’t so in built it was very simple to add to structured logging.

phillipcarter · on Sept 11, 2022

In the C# world, tracing is also just a config away. It's baked into the runtime. But you don't need to manually stitch together logs into a "trace" - it just does it for you.

jbjbjbjb · on Sept 11, 2022

What I meant was you can easily add a correlation id to your logs, and send those to wherever you collect and query the logs. Then using that you can easily see the log across your services. So there isn’t any manual stitching in the logging either.

phillipcarter · on Sept 11, 2022

You can! And then if you want to establish causality from log to log, representing which part of a system calls which other part, you can add another id that lets you know which log is a “parent” of another. And maybe later, you may need to propagate metadata across a request, so you build that system and use it to better enrich your logs. And maybe later, you’ll need to correlate subsets of logs with other subsets of logs, so you build a way to “link” related sets of logs together. And so on, and so on. All of this is possible with “just logs” today and if it’s your jam, go for it. But these are reasons why tracing libraries exist.

akira2501 · on Sept 11, 2022

I'm of the opinion that we should use both. They're both fundamentally useful and solve separate problems, and I don't think there's any resource constraints that would make this an issue.

A logging API that just lets me write either a textual string, a structured telemetry object, or just both would be ideal. Then I can have either source separately, or I can view how they generated events collectively.

phillipcarter · on Sept 11, 2022

OpenTelemetry's design says: why not both! Logging is still early days, but the idea is that you can take your existing structured logs and have them automatically correlated with traces that you can add later.

(there's other use cases some can envision, but I'm team "tracing is better than logging", while recognizing that most apps don't start from scratch and need to bring their logs along for the ride)

jen20 · on Sept 11, 2022

I'm in the position of working on a green-field system currently, and spent a bit of time looking at OpenTelemetry both as a logging system and for metrics and tracing. If OpenTelemetry can indeed suit that use case today, it's not evident from any of the documentation or examples, and the naive "go get" approach in a test project led to incompatible versions of unstable libraries being pulled in.

Further, the OpenTelemetry Go library pulls in all kind of dependencies which are not acceptable: a YAML parser which doesn't even use compatible tagging and versioning via Testify, logr and friends, and go-cmp. Combined with the insistence of GRPC on pulling glog and a bunch of other random stuff like appengine libraries, you end up with a huge dependency tree before you've even started writing code.

The result is I'm sticking with zerolog, and a custom metrics library, and will let Splunk do it's thing. It's a shame, as distributed tracing is very desirable, but not desirable enough to ignore libraries that have chosen concrete dependencies vs interfaces and adapters.

Fundamental things like this need to be part of the standard library, or have zero external dependencies. Rust does (in my opinion) a better job here despite requiring on average more libraries to achieve things, because features can be switched off via Cargo.toml, reducing the overall dependency set.

phillipcarter · on Sept 11, 2022

(note: I'm not a maintainer in the Go SIG so I can't speak towards these issues with much authority)

Definitely interested in how things can be improved here. What are you generally looking to accomplish? Hopefully it's just a matter of docs not accurately expressing what's possible today (a consistent theme for quite some time that's getting resolved, but it takes time).

As for the dependency issues, is this something you'd have the time to file an issue about[0], or make an agenda item for the next Go SIG? At the end of each SIG there's also a moment where the SIG leader asks for stories about adoption (good or bad) and this would clearly count. I'd like to think that these issues are solvable, or at least solvable enough to make it worth it for you to adopt.

[0]: https://github.com/open-telemetry/opentelemetry-go/issues?q=...

jen20 · on Sept 11, 2022

Thanks for the thoughtful response.

I think the aim is to have a single "sink" for metrics, logs and tracing, and it's clear that OpenTelemetry is aiming to be that. Ideally from a single "go get", one should be able to pull in a library without additional dependencies that:

- permits recording of structured logs (perhaps the stdlib interface under discussion will help here!). From a personal perspective I'd rather not have an API that accepts key-value pairs as varargs, but if that was the only thing available I'd take it I think.

- permits recording of counters, gauges, timers and histograms,

- permits recording of spans, and associating both logs and metrics with those spans.

The Go library README shows the logs section as "frozen" and states that no pull requests to improve logging are being accepted. With no visibility into the process (which is, to be clear, because I haven't looked into it and not the fault of the project per se!), it looks like this might be an option in future but not today?

phillipcarter · on Sept 11, 2022

I see! Yeah, this is one where where otel-go is a lot harder to use, but it's something the SIG is looking at. A coworker of mine is helping drive a design that's sort of an "easy button" to configure all the things with the least-surprising defaults[0] and we're seeing how people like it in our SDK distribution that uses it[1]. I hope that sometime soon we'll have the design polished-up enough to get merged in. Like most OSS projects, it'll take some time but I'm confident we can get it done.

The main challenge is that there's a large variety of use cases to fulfill (e.g., someone wants custom context propagation, a custom span processor, and export over HTTP+json but not HTTP+protobuf) and today the answer to that is that you have to pull in all the libraries for all the things you need. It's a lot more energy you need to expend to get started with all of this than it needs to be.

As for logging support in the Go SDK, it's frozen mostly just due to lack of bandwidth and a need to finish what's already been started. Metrics have proven to be much more difficult and time-consuming to implement correctly across all languages, with Go being impacted harder than other languages (e.g., Python and .NET). I think you can expect logging integrations in the near-ish future though.

This is great feedback. I'll pass it on folks who haven't seen it. Thank you! And please feel free to file issues about all the things that rub you the wrong way

[0]: https://github.com/open-telemetry/opentelemetry-go-contrib/p...

[1]: https://github.com/honeycombio/honeycomb-opentelemetry-go

magicalhippo · on Sept 12, 2022

> OpenTelemetry's design says: why not both!

Been spending 15-20 minutes on the OpenTelemetry site trying to figure out "what then?"

Seems the only exporters that's supports both are... log files? So we're back to square one? What am I missing?

phillipcarter · on Sept 12, 2022

.NET, Python, and Java have OTLP exporters for both today. Logs are still mostly alpha/experimental today, but stable in the data specification, so you can start using them together today.

jyounker · on Sept 11, 2022

Or you include the trace ID as one field in the structured log. That ID can be threaded through the log library.

cube2222 · on Sept 11, 2022

I agree.

I'm not necessarily a fan of OpenTelemetry, as every time I looked at it it was still "in progress" (though the last time was almost 2 years ago), but in general I agree.

We started with tracing+logging, but as time passed we moved more and more stuff into tracing. But not just span-per-service kind of tracing - detailed orchestration on the level of "interesting" functions, with important info added as tags.

Right now we use almost exclusively tracing (and metrics ofc) and it's working great, even if each trace often has hundreds of spans.

In our case we use Datadog, which works well for most bits and purposes, but we also send all traces over Firehose to S3 so that we can query them with Athena at 100% retention. This way you get your cake and can eat it too.

phillipcarter · on Sept 11, 2022

If you've got the time for instrumenting something relatively small, it's definitely worth picking up OTel again to see how it works for you. Tracing is stable in most languages (Go included) and there's a lot more support in each language's ecosystem.

FWIW I think OTel will always be in a state of "in progress" though. There's more scenarios and signal types that will get picked up (profiling, RUM) and more things to spread across the ecosystem (k8s operator improvements, collector processors for all kinds of things, etc.) and more integration into relevant tools and frameworks (.NET does this today, maybe flask or spring could do it tomorrow). But I wouldn't let that stop you from trying it out again to see if you can get some value out of it!

spirit-sparrow · on Sept 11, 2022

Log the things that require a human to fix something every time they happen (Include useful info that’s needed for fixing). Create metrics of the rest: things that show patterns. Trace a percentile of actions that span over multiple components.

I believe all the three: logging, metrics, tracing have their own uses, one can’t replace the other.

azlev · on Sept 11, 2022

Telemetry and logs are distinct. It makes sense to gather metrics and it makes sense to catch sporadic messages of what's going on in details.

javier2 · on Sept 12, 2022

I dont disagree, but who the hell has time to setup and maintain the OpenTelemetry instrumentation and its myriad of other components.

phillipcarter · on Sept 12, 2022

Would love to learn which areas of OTel you find are onerous to set up and maintain. Experiences can differ from language to language which is why I ask. For example, with JS/Node there's a convenient SDK package and autoinstrumentation metapackage, but with Go you need to install a lot more stuff to get an equivalent experience.

javier2 · on Sept 20, 2022

We mainly use go and java, with different dbs and way too many http clients. But the instrumentation is even just a small part of it, we actually had most services fully instrumented with OpenTracing, but then they decided to abandon OpenTracing for OTel... And even then, the more difficult thing is running the trace ingestion without it crashing or running out of resources all the time, and balance that with just having everything disabled. Our most stable setup was just running with the in memory store so we at least have a trace for a few hours when we need it, but we essentially just gave up.

jahewson · on Sept 11, 2022

Nice idea but I don’t want to have to take the hit of tracing every request.

motiejus · on Sept 11, 2022

If we ignore the proposal, this gives at least an awesome overview of the state-of-the-art of logging in go. That was the best part. Thanks for the heads up!

egnehots · on Sept 11, 2022

Prior Work

- The existing log package has been in the standard library since the release of Go 1 in March 2012. It provides formatted logging, but not structured logging or levels.

- Logrus, one of the first structured logging packages, showed how an API could add structure while preserving the formatted printing of the log package. It uses maps to hold key-value pairs, which is relatively inefficient.

- Zap grew out of Uber's frustration with the slow log times of their high-performance servers. It showed how a logger that avoided allocations could be very fast.

- zerolog reduced allocations even further, but at the cost of reducing the flexibility of the logging backend.

- All the above loggers include named levels along with key-value pairs. Logr and Google's own glog use integer verbosities instead of named levels, providing a more fine-grained approach to filtering high-detail logs.

Other popular logging packages are Go-kit's log, HashiCorp's hclog, and klog.

thechao · on Sept 12, 2022

All the sqlite love, here, recently... but none on logging? Using a `:memory:` SQLite DB, combined with the backup API is perfect for one-shot application (non-server processes). The formatting I do is cleverly composed SQL queries.

srcreigh · on Sept 12, 2022

Are you still talking about logging?

Anyways, would love to hear more about your clever queries, even if it's OT.

thechao · on Sept 12, 2022

Yep. The "style" of logger is in C, and looks (roughly) like this:

Logger* OpenLog(char*);

Table* MakeLog(Logger*, unsigned level, unsigned N, char* colNames, ColType *colTypes);

LOG(Table, Level, ...);

The branch to select log-level is trivial compared to the append cost. There's a few "backends" built in to the API: (1) a text-streaming version that just writes a text log; (2) a "bucketing" version that boils the calls to `MakeLog` down to accumulating buckets (with optional stack tracing); and, (3) journaling to a SQLite DB (with optional stack tracing).

srcreigh · on Sept 12, 2022

Do you use glibc backtraces for stack tracing? I don't use C much, i guess that may be standard. or something else?

Also, can you describe the queries you run against this SQLite? Is it interactive use or do you have any jobs or charts that are powered by it?

So very interested in this whole approach - if you have any posts or links to read more happy to see em!

wsargent · on Sept 12, 2022

Same! https://github.com/tersesystems/blacklite/

GauntletWizard · on Sept 11, 2022

Levels are a great abstraction for debug logs. Debug logs are often abused for much more than their reasonable purpose.

Audit logs are different from debug logs, and it's pretty common for people to not think about the reasons why. Audit logs are records, and they don't have levels; they might have metadata, but you need to capture all of them, reliably. Debug logs are ephemeral and if you miss some of them you shouldn't care.

The problem with structured logging libraries is that they try to do both. There's some in common, but the mechanisms of output should be really different - Debug logs in a circular buffer in RAM makes a lot of sense, but you can't do that for Audit logs. The current approach seems to be "Write them all to the same place, then filter them out", and that only adds more latency in which to lose your audit logs.

marcus_holmes · on Sept 11, 2022

When I last discussed this we ended up in a huge debate about whether we should be sending different log levels to different destinations, because of this dichotomy.

Debug messages shouldn't be stored on permanent logs (they frequently contain sensitive/private info). Alerts need to be sent to someone immediately (even if that's a different process monitoring the log file for alerts and raising them). Errors need context in order to diagnose them (ideally a stack trace). All of these are doing different things using the same interface.

Log levels end up being abused because of this. What we want is to be able to handle different situations differently. What we get is log levels.

jen20 · on Sept 11, 2022

> Audit logs are different from debug logs

Very much so - audit logs, where they are important, are often part of the actual business domain and should be explicitly modelled, not delegated to infrastructure with a slightly different level.

tptacek · on Sept 11, 2022

This appears to be why there's a separation of concerns between "handlers" and "loggers".

GauntletWizard · on Sept 11, 2022

That's exactly what I'm railing against. You shouldn't be pushing all these things into the same stream and then separating them out again. You may want to make it easy to generate a debug log when you generate an alert. You shouldn't do that by writing a log. You should have a very different set of parameters for

If you're a bank, your login "Logs" shouldn't just be an annotated web request; you should have an explicit "AuthenticationSuccess" type in a non-ephemeral data store, and it should contain more than just "301 /auth 1.2.3.4", it should include information about the factors used to make that decision.

Semi-structured logs don't have that, but will be abused to provide that. It should be a type system error to create an AuthSuccess without all that, not a set of default parameters to a generic logging call.

Structured logging is great for moving fast, and I use it all the time to make it easy to filter through my debug logs, but levels are an abstraction in debug logs - and handlers are completely an anti-pattern

tptacek · on Sept 11, 2022

You've lost me, because the point of having a handler interface is exactly to be able to do things like ensuring that your "authentication success" events get stored differently than your debug messages.

You can see upthread that the trend is kind of in exactly the opposite direction to what you're advocating for: plenty of people think there's no such thing as a log or an audit record, only "events", expressed as otel spans. Have fun with those people! :)

didip · on Sept 11, 2022

If Zap and Zerolog can agree on a common interface, I think the rest of Go world would follow.

javier2 · on Sept 12, 2022

I have not used Zap, but I really like the Zerolog API which I have used extensively. I dont have any religious fight with Zap vs Zerolog, I just happened to use Zerolog first.

xani_ · on Sept 11, 2022

Seriously, Zap nails most of what I want from logger and the remaining few % isn't exactly hard to add on top.

Could maybe get a bit of ergonomics pass but that's it

s-xyz · on Sept 11, 2022

Great initiative and well structured problem. Dealt with something similar before, not as detailed as this, but with focus on logging metrics for the purpose of analytics and performance measurement (yes I am aware a db would have been an option too for the analytical part - long story to explain the rationale that I’ll leave out here).

latchkey · on Sept 11, 2022

While we can debate all day long what should and should not be part of the standard lib (bike shed), I just ignore all of this and happily use zerolog.

Problem solved, one dependency, it just works. Back to writing code.

tptacek · on Sept 11, 2022

The proposal acknowledges that. Part of the point of working this out is that it creates a common interface so that your zerolog-logging code plays nice with someone else's hclog-logging code.

oezi · on Sept 11, 2022

Unless the new generic logging API is designed badly so that certain functionalities become hard in the future because they deviate.

For instance when Rails did standardize ActiveJob (background tasks) they didn't include a way in the API to query the current length of a queue of background tasks. Because everyone is now only trying to conform to the API accessing such information is now hard in many of the implementations.

latchkey · on Sept 11, 2022

A long time ago, I was part of JSR-47. I wonder if anything can be learned from that process (which was a bit of a mess).

javier2 · on Sept 12, 2022

I had to look it up, but JSR-47 experience made there sounds very relevant to this issue.

latchkey · on Sept 12, 2022

https://bigthink.com/culture-religion/those-who-do-not-learn...

jen20 · on Sept 11, 2022

The problem is, it's not one dependency. Several important libraries pull in a variety of wrappers and adapters - just to name three:

- glog (gRPC)

- logr (OpenTelemetry and friends)

- klog (Kubernetes client)

It's quite conceivable that even having decided to use zerolog, you end up with at least four separate logging libraries in your dependency tree before having written any of your own code.

latchkey · on Sept 11, 2022

True, but as long as I can either control the logging or write an adapter, I'm ok with that.

At the end of the day, I'd still have that issue, unless the standard lib was being used everywhere... which is something that is equally not controllable.

jen20 · on Sept 11, 2022

The hope of a standard structured logging interface in the standard library is that serious libraries will adapt themselves to use it, and the proliferation will disappear over time. Whether or not this will actually happen, or whether the libraries which do not play ball will have to be forked is a different matter.

latchkey · on Sept 11, 2022

"serious" libraries today all provide the ability to pass in adapters.

Having been in the group of a multi-year effort to do this in Java (JSR-47), there is always a limitation in the API that prevents someone from getting it right.

Funny enough, even though Java ended up with JUL (java.util.logging), SLF4J ended up being the defacto standard that everyone went with.

jen20 · on Sept 11, 2022

> "serious" libraries today all provide the ability to pass in adapters.

Sadly they don't, unless we count things like gRPC, Kubernetes clients and so forth as non-serious.

latchkey · on Sept 12, 2022

Have they indicated they would adopt a SLI?

svnpenn · on Sept 11, 2022

Every time I come across a logging package, I don't understand what the point of them are. Finally in the linked issue, I see this:

> For many programs that aren't run directly by person, like servers, logging is the main way for developers to observe the detailed behavior of the system, and often the first place they go to debug it.

and it makes sense. I don't deal with servers at all, other than a couple I have made just to host some basic sites or programs locally. I think Go team might be overlooking the fact that a big chunk of the userbase doesnt understand, and has no need for a logging package. Couple that with the giant API proposed, along with a poor design decision (kvs ...any), and the large amount of bikeshedding that already happened in this issue, I really hope they think hard before just shoving this into "std".

atsjie · on Sept 11, 2022

The majority of Golang devs actually do work on web services, API's, or libs/frameworks that would benefit from a standardized structured logging interface.

Source: https://go.dev/blog/survey2022-q2-results (scroll almost to the bottom)

For the audience of this proposal it's probably fairly evident what the advantages of standardization would be.

Also you argue quite strongly about the poor API design decisions being made; while you similarly admit not having worked with such tools or with web services in Golang general. That's odd.

svnpenn · on Sept 11, 2022

I don't work with servers much, but I do have extensive experience in Go. I know a bad API when I see one. strings.Replacer [1] uses the same syntax, and several others have commented in the discussion about it being a bad idea.

1. https://godocs.io/strings#Replacer

xani_ · on Sept 11, 2022

From what I've skimmed the most problems seem to be "golang's type system won't allow expressing all of that on compile time so it will be costly" and I don't think that can be sensibly fixed in logging package

> I know a bad API when I see one. strings.Replacer [1] uses the same syntax, and several others have commented in the discussion about it being a bad idea.

What's wrong with it ? Never used it. But I think the complaints has been about wasting time for runtime type handling and Replacer doesn't have that problem.

Joker_vD · on Sept 11, 2022

The main performance problem is with "kvs ...any" parameter in the Logger's functions because it allocates lots of stuff: the array itself, and then the wrappers around the values. I've tried to use something like this in a central loop of a CPU-emulator to trace every instruction executed and IIRC it slowed things about 3x times down. The only thing you can do to escape this cost is

   if logging.Level() >= logging.DEBUG {
       logging.Logger.Debug(...)
   }

but it's ugly and can't be abstracted away. And yes, I've put several of such guards in the several hot paths of some actual production code base and seen about 25% speed up after that edit.

jen20 · on Sept 11, 2022

> What's wrong with it ?

It relies on lining up pairs of values in a way that cannot be checked by the compiler rather than having dedicated functions that ensure they are kept together.

This is the fundamental difference between bad logging libraries (like klog, hclog etc) and good logging libraries (like zerolog).

xani_ · on Sept 11, 2022

I guess simplest fix for that particular one would be accepting `map[string]string` but that won't exactly work for the any type, and you can't easily add more.

Maybe `...map[string]interface{}` but we're back to "the overhead is kinda shit"

xani_ · on Sept 11, 2022

Develop bigger app and you will understand.

But one example - having an logging interface that is not shit (the current builtin is so garbage it might as well not exist at all) allows for packages to adopt it as default method and you magically can just fed same logger to every dep you pull with no loss of features or having to write wrapper.

Contexted logger means you can just fed ["plugin":pluginname] tagged one to every plugin and get logs pointing you to the place where it happened.

The structured logging part is mostly to avoid woes of then having to parse it back to structured format via bunch of regexpes that will inevitably break next release. If you still don't get it, here is a regexp to parse HAProxy logs:

    .*haproxy\[(\d+)]: (.+?):(\d+) \[(.+?)\] (.+?)(|[\~]) (.+?)\/(.+?) ([\-\d]+)\/([\-\d]+)\/([\-\d]+)\/([\-\d]+)\/([\-\d]+) ([\-\d]+) ([\-\d]+) (\S+) (\S+) (\S)(\S)(\S)(\S) ([\-\d]+)\/([\-\d]+)\/([\-\d]+)\/([\-\d]+)\/([\-\d]+) ([\-\d]+)\/([\-\d]+)(| \{.*\}) (".*)([\n|\s]*?)$`)

> I think Go team might be overlooking the fact that a big chunk of the userbase doesnt understand, and has no need for a logging package

How does that proposal impede them ?

if you don't need any of the extras you'd be able to just `log.Info"

kaz-inc · on Sept 11, 2022

I also didn't understand the need for a logging package for a while, but I think it makes sense for go because they're focusing on an audience that writes server applications.

xani_ · on Sept 11, 2022

It's less about package and more about interface.

One "good enough" in stdlib means libraries could just require passing a interface matching it. Then no matter which logging package you use, as long as it fulfils the interface of stdlib logger it can be used with any library or other component you pulled.

So with simple app you can just have simple logger that dumps everything on stderr, complex app can have whatever complex logging logic it needs, but lib "just using stdlib logger" doesn't care about that.

And as a lib author, again, you can just use the stdlib one and most likely cover all your use cases.

tptacek · on Sept 11, 2022

It's not going into std; it's going into /x.

marcus_holmes · on Sept 11, 2022

effectively the same place at this point. There's so many "standard" libs in /x that trying to constraint to "just the standard lib" has to include /x to be actually useful.

tptacek · on Sept 11, 2022

They're not at all effectively the same place, and much of what's in /x competes with third-party packages, which are on a somewhat level playing field in that both /x and third-party code have to be explicitly brought in to a project as a dep.

marcus_holmes · on Sept 11, 2022

meh, I'm always going to prefer an /x package to a 3rd-party package.

philosopher1234 · on Sept 11, 2022

Such as?

marcus_holmes · on Sept 11, 2022

goimports, godoc and the other tools (https://pkg.go.dev/golang.org/x/tools - you basically can't use Go properly without these)

all the crypto stuff (https://pkg.go.dev/golang.org/x/crypto) if you're doing anything with crypto then this is needed

others are really useful (see index at https://pkg.go.dev/golang.org/x) - I've imported the image, sync and text packages

it's a lot of the "missing bits" from the standard lib. I get why they're in /x but the distinction is moot in most cases.

tptacek · on Sept 11, 2022

Almost nobody uses goimports or godoc as a declared dependency; they're tools, not standard library components.

It's perfectly possible to implement cryptography without touching x/crypto; std includes SHA2, HMAC, an AEAD interface, ED25519, and, of course, Go's entire TLS stack. Arguably there's too much cryptography in std (almost nobody is better off directly building on cipher cores and signature algorithms). Thankfully: most Go programs don't implement cryptography, which is dangerous even with Go's relatively solid library interfaces.

marcus_holmes · on Sept 12, 2022

I used the Go crypto primitives in /x to build a perfectly serviceable document encryption mechanism inside a document storage service. It was relatively painless and worked well. I can't really comment on how secure it ended up being - you can always build a crypto product that you can't break.

The standard advice is always "never write your own crypto" and it's very good advice.