I’m increasingly of the opinion that we should stop using “logging” libraries, and instead we should produce telemetry via events using something like OpenTelemetry’s trace API.
These traces can be rendered as strings for people who like their logs in files
Traditional logging has a number of problems. Developers tend to just stick `log.Info` (or whatever equivalent) anywhere.
Using an http server as an example:
The result is that one http request may produce dozens of log messages. But because multiple http requests are running at the same time, they all get interleaved in the actual log.
They can be tied together after the fact if you're using a request id, but that requires post processing. It can be difficult to know when a request ends, so that means there needs to be an explicit "end of request" type message, otherwise assume that it could still be active until the end of the file...unless your log is rotating, in which case it could be in the next file, or the next.
On top of that, it's a big performance hit. On every log message, you're serializing the message, acquiring a mutex, flushing it out to disk. Depending on the library, some of that may or may not get evaluated depending on message severity. It's a non-trivial overhead, and individual messages can cause quite a bit of contention for that mutex.
Using something more like tracing, a log message gets attached to a context for the task, and it's not until the task is complete does anything get logged. This makes for a single log message that has a list of all the individual informational messages, as well as other metadata about the request. One mutex acquisition, one serialization, one file flush.
It's at that point where tracing and logging are arguably the same thing.
> Using something more like tracing, a log message gets attached to a context for the task, and it's not until the task is complete does anything get logged. This makes for a single log message that has a list of all the individual informational messages, as well as other metadata about the request. One mutex acquisition, one serialization, one file flush.
That's kinda terrible design. You don't see anything in progress, only if it finishes, and anything in progress just disappears, never to be seen again, if app crashes.
Splitting to event per action in a given context is a feature, not a bug
Not necessarily, this is an implementation detail.
Azure Application Insights for example streams events to a log file asynchronously. A crash is typically captured as a memory snapshot crash dump anyway, so if the entire process terminates then you have something that's probably superior to a text log!
If you open the crash dump, you can inspect the in-memory buffer of log events directly in a debugger.
If the operation fails internally with an exception, then the logs are captured as normal including the exception details.
Asynchronous logging is dramatically more efficient. Or to put it another way, synchronous logging often causes performance bottlenecks that surprises developers.
If you do want "live" logs equivalent to viewing the tail of a text file, most systems (such as App Insights mentioned above) can dynamically switch from async to sync mode and stream all events second-by-second.
Just look into log4j: mdc/ndc to include context (user/session), include the thread name in your pattern and use an async appender to take care of performance. All of these things have been solved already.
MDC/NDC only reliably works when you don't have asynchronous code (Akka/Vert.x/CompletionStage) through your application. As soon as you start using multiple threads, it becomes significantly more complex to keep MDC context carried over from one thread to the next.
I mean...yes, that's how I was able to highlight the benefits of having a single log message per request.
> Just look into log4j: mdc/ndc to include context (user/session), include the thread name in your pattern and use an async appender to take care of performance.
At a glance this looks like it will still produce 1 log line per logging call, versus buffering with the request. adding the thread id just helps stitch them together after the fact. This doesn't get rid of any performance overhead -- it may get it out of the way of a user request, but it's still there under the hood.
In the go case, using the runtime/trace library as an example[0], a `context.Context` is passed down the call stack which gets decorated with trace spans/regions/log messages on the way down. At the end of a task, all of the collected context gets written to the active trace log.
I like my log to include where something happened and in some complex cases having a few thousand lines of logging info per request is not unreasonable. So stuffing al that info into a single log event/line doesn't make sense to me. I recently had the unfortunate situation where two seemingly independant sessions were interfering, here a sequential logfile where things were logged in the order occurring was the only way to correlate them. Even though we do telemetry via events, old school logfiles are a godsend to debug the complex cases.
Logging to syslog already solves the problem of asynchronous / streamed logging.
Structuring log messages better than an arbitrary text string is harder (even though I heard the widespread systemd offers a way). Mostly it's because it's hard to agree upon a universal standard. Logging in a self-describing structure is basically the same as logging in JSON (which is not unheard of). Maybe you could do a bit better logging in protobuf or thrift records.
Syslog doesn't solve any of the problems outlined above - you're just trading a file for a unix domain socket. And it introduces it's own problems, such as sharing that socket with the containers (if you just mount /dev/log into the container, you're gonna have a bad time when the syslog daemon restarts). Log rotation also becomes a problem at higher volumes. Want a 5 minute rotation? logrotate isn't of much help.
re: logging in protobuf/thrift. Thrift is more well suited towards it due to it's stream friendliness. With protobufs, you can do better performance wise than json, but does tend to be a bit more annoying because you need to write a length header between messages. (at this point, I've considered logging directly into a `.tar` file with one file per request. I haven't ever done it, and wouldn't in production, but it's an intriguing thought to me.) With either format, you have to deal with the fact that they're now binary logs, and you lose out on the ability to use a number of tools like grep & awk, or you have some program to cat the log as plain text.
Logging to json, while it's a bit more overhead, offers access to a lot of tooling which is lost with binary logging or unstructured logging. Athena, snowflake, sqlite, kafka/ksql, jq, it's incredibly ELK friendly...The list goes on. In general, it's usable without the message definitions, so it can be much easier to use.
All of these tradeoffs wildly depend on what it is that you're logging and what your needs are.
> Mostly it's because it's hard to agree upon a universal standard
This is one of the things I’ve appreciated in my corner of the Nodejs ecosystem. It’s fairly standardized around pino and bunyan. With async hooks you can instrument all downstream async APIs too and associate them with the incoming request. So, by default, in my services all logs are associated with their incoming requests.
That being said - I’m inclined to agree that, as the distributed tracing ecosystem becomes increasingly mature, it really is eating the use case of logs. The real trick with tracing is being able to associate a spike in latency for 5xx requests in metrics with all their associated traces. With logs I can aggregate all 5xx logs, filter by latency, and get the subset of logs for the high latency requests to dig in. In more mature stacks, I can even conditionally increase log sampling on-demand for those requests to get a better view into the problem (especially helpful for rare events that are unlikely to make it past the sampler).
I’m not good enough with traces yet to do this. Today I have to start with logs to find the request IDs to get to the traces.
Logging is the lowest of all debugging utilities - its the first thing you ever do writing software - “hello world”. And, while I admire structural logging, the truth is printing strings remains (truly) the lowest common denominator across software developers.
I often see confusion on how to interpolate variables - let alone zero standardization on field formatting, and poorly managed logging infra, too.
Adding another barrier to entry (protobufs) isn’t going to drive better usage. Any solution I could prescribe would inevitably chop off the long tail of SWE (raise the common denominator), and I think that’s going to be quite an unpopular position to advance in any established company.
To be clear: structure is good, our expectations of how to structurally log are too low, and introducing a build step to compile PB to native, and then dealing with a poorly generated (my opinion) API to log, sounds like a miserable experience.
Yes, or any other data format and/or transport protocol.
I'm surprised this is up for debate.
> Logging is the lowest of all debugging utilities - its the first thing you ever do writing software - “hello world”. And, while I admire structural logging, the truth is printing strings remains (truly) the lowest common denominator across software developers.
This sort of comment is terribly miopic. You can have a logging API, and then configure your logging to transport the events anywhere, any way. This is a terribly basic feature and requirement, and one that comes out of the box with some systems. Check how SLF4J[1] is pervasive in Java, and how any SLF4J implementation offers logging to stdout or a local file as a very specific and basic usecase.
It turns out that nowadays most developers write software that runs on many computers that aren't stashed over or under their desks, and thus they need efficient and convenient ways to check what's happening either in a node or in all deployments.
What I found was that it's typically not the binary encoding vs string encoding that makes a difference. The biggest factors are "is there a predefined schema", "is there a precompiler that will generate code for this schema", and "what is the complexity of the output format". With that in mind, if you are dealing with chaotic semi-structured data, JSON is pretty good, and actually faster than some binary encodings:
Yes, I have (and currently do) use slf4j. It’s not a panacea. Maybe you also write software that services millions of MAUs with engineering teams spread across the world - and seen immensely different results. That’d be a breath of fresh air.
However, as a stickler for quality, I find myself on the wrong end of this discussion in practice. I insist, logging is truly the LCD of software development. Introducing unergonomic data structures generated from protobufs strikes me as something that the long tail would ignore, and instead simply reach for something simpler to instead log `here`, `here1`, etc.
> But because multiple http requests are running at the same time, they all get interleaved in the actual log.
Many logging libraries will export a structured preamble including the timestamp and PID, which is enough to reconstruct single-threaded time histories when you want them, while still allowing you to see cross-thread time-correlated issues when you need that as well.
Unfortunately async code often blows that idea up. Cf. Node apps where the log comes from some arbitrary callback with little to no indication of where the root cause occurred.
> It's at that point where tracing and logging are arguably the same thing.
On some level I agree with this, but honestly tracing doesn't have nearly the same adoption that logging has (at least in my experience).
Why? Because generally it's easier to approach logging, even if in ways that don't necessarily scale that well: like getting your application writing logs to a file on the server directory.
Note: I mention Java here, but a lot of it also applies to Node/Python/Go/Ruby etc.
Eventually they can be moved over to a networked share, have Logrotate running against them, or perhaps have proper log shipping in place. That is perhaps the hardest aspect, since for tracing (outside of something like JavaMelody which is integrated inside of your app) you'll also need to set up some sort of a platform for the logs to be shipped to and processed in any number of ways.
Personally, I've found that something like Apache Skywalking is what many who want to look into it should consider, a setup utilizing which consists approximately of the following:
- some sort of a database for it (ElasticSearch recommended, apparently PostgreSQL/MariaDB/MySQL viable)
- the server application which will process data (can run as an OCI container)
- the web application if you need an interface of this sort (can run as an OCI container)
- an agent for your agent of choice, for example a set of .jar files for Java which can be setup with -javaagent
- optionally, some JS for integrating with your web application (if that's what you're developing)
Technically, you can also use Skywalking for log aggregation, but personally the setup isn't as great and their log view UI is a bit awkward (e.g. it's not easy to preview all logs for a particular service/instance in a file-like view), see the demo: https://skywalking.apache.org/
For logs in particular, Graylog feels reasonably sane, since it has a similarly "manageable" amount of components, for a configuration example see: https://docs.graylog.org/docs/docker#settings
For most of the people who have to deal with self-hosted setups where you might benefit from something like tracing or log shipping, actually getting the platform up and running will be an uphill battle, especially if not everyone sees the value in setting something like this up, or setting aside enough resources for it. Sometimes people will be more okay with having no idea why a system goes down randomly, rather than administering something like this constantly and learning new approaches, instead of just rotating a bunch of files.
For others, there are no such worries, because they can open their wallets (without worrying about certain regulations and where their data can be stored, hopefully) and have some cloud provider give them a workable solution, so they just need to integrate their apps with some agent for shipping the information.
For others yet, throwing the requirement over to some other team who's supposed to provide such platform components for them is also a possibility.
> Why? Because generally it's easier to approach logging
That's due to a significant head start. When one practice has more than a decade to mature by spreading across an ecosystem and becoming more ergonomic, it's definitely going to be easier to adopt.
However, tracing as come a LONG way with OpenTelemetry even in the past year. There's still a big mountain to climb, but making tracing better/more widespread and providing on-ramps to metrics and logging systems is exciting.
Logs come in two forms: structured and unstructured. Throw your unstructured logs in the garbage bin if you can - they do very little to help systematically analyze systems.
Structured logs are logs that adhere to a schema and let you attach key:value pairs of additional metadata, so you can add context from your app to those logs. They're extremely helpful for systematically debugging systems, but they're not correlated with anything. With only logs, you need to divine what led to such a log yourself, or manually re-create tracing within your logging system.
Traces are basically structured logs but have correlation IDs that let a backend stitch them together. Tracing libraries also offer a lot more than just this, but it's the core of tracing. I like to think that tracing is just an objectively better form of structured logging.
The otel trace API is nowhere near ergonomic enough to replace logging. I think people mostly tolerate otel because the idiomatic way to use it is just a funcall and a defer at the top of a subset of your functions, and because if you're otel-instrumenting, you're probably already passing a context.
FWIW tracing in Go is primarily made more difficult by having explicit context propagation and no helper to auto-close a span based on scope. If you use Python or .NET for example, it's implicit and you can "just create some spans" like you would a structured log somewhere in your app, then use built-in language constructs to have it handle closing for you.
Sure. If it was a different language, or a different otel library, my opinion would probably be different. I do not think it's realistic to replace logging with otel in Go, though.
I think it's an inclusive OR here. When logging stabilizes in the Go SDK (which will be a while, perhaps a year) the idea will be that you can keep all your existing logging and then start to add tracing judiciously so they're easily correlated with requests. I think the juice is worth that particular squeeze for more folks at least.
From there I would personally advocate for adding more manual traces rather than more logs, but it's understandable if it's a heavier lift to pass context around all the time. The more options the better.
For web-style application code my experience writing Go code says that you totally can stop using logs and just use traces and have just as much visibility into a running process, if not more.
For library code or other deployment models I’m not sure - it’s not something I’ve had a need to experiment with
.Net has task-local storage which is pretty amazing and Seri log made great use of it(my all-time fav structured log lib).
But iirc task local is a perf concern, and JavaScript has yet to have anything comparable that's ubiquitous(nodejs async hooks though so one day maybe).
So, explicit contexts for Go and JavaScript unfortunately.
JS does have convenience APIs in the form of startActiveSpan and getActiveSpan that take care of most of the pains here. There's likely some scenarios where you can't use these (e.g., instrumenting a tool must use the sdk-trace-base package) but I think in most cases they help a lot. Go doesn't have these kinds of convenience APIs though.
What annoys me about otel is that it seems to require direct live connections. I'd much rather just add the tracking keys / spans / whatever to my logs and parse them out of there later, but no standard lib / app / process seems to like that idea. I have no intention of adding extra sockets to my requests when I can just dump everything to the logs as JSON.
So that's what I do, but instead of relying on otel tools, I have custom queries I run against elasticsearch (or grep syslog) using my trace tags.
Works great. No multi-process visualizations but i can make my way through a process chain quickly enough
I've played with trying to create an idiomatic wrapper for their Go module precisely because of this reason, but then didn't find time to continue with this.
> I’m increasingly of the opinion that we should stop using “logging” libraries, and instead we should produce telemetry via events using something like OpenTelemetry’s trace API.
That really depends on what you mean by "events". Even OpenTelemetry supports logging events. Logging is not replaced by emitting metrics, as they have very distinct responsibilities.
> These traces can be rendered as strings for people who like their logs in files
You're missing the whole point of logging stuff. The point of logging things is not to have stuff show up in files. Logging events are the epitome of observability, and metrics and tracing events end up specializations of logging events which only cover very precise and specific usecase.
> Logging events are the epitome of observability, and metrics and tracing events end up specializations of logging events which only cover very precise and specific usecase.
This is a common way of thinking about things, but having now worked with event-based tracing for the last few years I no longer believe traditional logs are necessary.
If you do “good” logging you likely have entirely structured data and no dynamic portions of the main log message. This is almost exactly equivalent to a span as represented by tracing implementations.
I have come to believe that these events are the core piece of telemetry, and from them you can derive metrics, spans, traces and even normal-looking logs.
I see it the other way around: tracing is essentially logging with hierarchy. If you can keep the context of a "parent" span around, then you can log all of the entries out and build up the entire trace from spans, although you can get pretty confused if you don't write them out in the correct order.
However, if you don't have context, then the log entry is the atomic unit -- metrics are log entries with numbers that can be aggregated, spans are log entries with a trace id and a parent span, and "events" in Honeycomb terminology are logs with attributes covering an entire business operation / HTTP request.
> This is a common way of thinking about things, but having now worked with event-based tracing for the last few years I no longer believe traditional logs are necessary.
You are free to use stuff as you see fit, even if you intentionally miss out on features.
Logging is a very basic feature whose usecases are not covered by metrics events, or even traces. Logs are used to help developers troubleshoot problems by giving them an open format to output data that's relevant to monitor specific aspects of a system's behavior.
It makes no sense to emit a metrics event with tons of stuff as dimensions and/or annotations that's emitted in a single line of code that's executed by a single code path when all you need is a single log message.
Also, stack traces are emitted as logging events, and metrics events are pretty useless at tracking that sort of info.
But hey, to each its own. If you feel that emitting a counter or a timer gives you enough info to check why your code caught an exception and what made it crash, more power to you. The whole world seems to disagree, given standard logging frameworks even added support for logging stack traces as structured logs as a basic feature.
A counter or a timer is clearly no substitute for a log, that would be an absurd statement for me to have made - which is why I said no such thing.
A span as modelled by OpenTelemetry is a a bag of developer-chosen structured data containing a name, a timestamp and some auto-generated IDs to correlate it with others.
It is effectively an open standard for how to collect good structured logs - although it is currently not commonly pitched that way by either tracing advocates or structured logging advocates.
These are commonly used to create trace waterfalls, but can also be used to derive metrics or simple dev-debug output.
I have used this approach myself to great effect, and also spoken to a number of people who have also found success with such an approach - but currently the momentum of existing approaches (and the fact that traditional logging still works well-enough for most people) has not led to wide adoption
Many tracing libraries get their data off logging libraries. Logging doesn't always mean "whoa, let's stop everything and write to a file", it is often a matter of buffering data or increasing counters until data can be sent out in various fashions.
I guess Golang is catching up but in something like C# or Java, the logging libraries already feel like tracing libraries. They already have contexts and structure.
What are you proposing should change? Log files should default to a verbose binary representation that captures more? What do you think is not captured by today's logging?
Tracing gives you correlation between "log lines" and across process/network boundaries by propagating context for you. You could replicate everything that tracing does by manually propagating around context-like metadata yourself, stitching the log lines together into a "trace". But why not just use an SDK that does that for you?
In the C# world, tracing is also just a config away. It's baked into the runtime. But you don't need to manually stitch together logs into a "trace" - it just does it for you.
What I meant was you can easily add a correlation id to your logs, and send those to wherever you collect and query the logs. Then using that you can easily see the log across your services. So there isn’t any manual stitching in the logging either.
You can! And then if you want to establish causality from log to log, representing which part of a system calls which other part, you can add another id that lets you know which log is a “parent” of another. And maybe later, you may need to propagate metadata across a request, so you build that system and use it to better enrich your logs. And maybe later, you’ll need to correlate subsets of logs with other subsets of logs, so you build a way to “link” related sets of logs together. And so on, and so on. All of this is possible with “just logs” today and if it’s your jam, go for it. But these are reasons why tracing libraries exist.
I'm of the opinion that we should use both. They're both fundamentally useful and solve separate problems, and I don't think there's any resource constraints that would make this an issue.
A logging API that just lets me write either a textual string, a structured telemetry object, or just both would be ideal. Then I can have either source separately, or I can view how they generated events collectively.
OpenTelemetry's design says: why not both! Logging is still early days, but the idea is that you can take your existing structured logs and have them automatically correlated with traces that you can add later.
(there's other use cases some can envision, but I'm team "tracing is better than logging", while recognizing that most apps don't start from scratch and need to bring their logs along for the ride)
I'm in the position of working on a green-field system currently, and spent a bit of time looking at OpenTelemetry both as a logging system and for metrics and tracing. If OpenTelemetry can indeed suit that use case today, it's not evident from any of the documentation or examples, and the naive "go get" approach in a test project led to incompatible versions of unstable libraries being pulled in.
Further, the OpenTelemetry Go library pulls in all kind of dependencies which are not acceptable: a YAML parser which doesn't even use compatible tagging and versioning via Testify, logr and friends, and go-cmp. Combined with the insistence of GRPC on pulling glog and a bunch of other random stuff like appengine libraries, you end up with a huge dependency tree before you've even started writing code.
The result is I'm sticking with zerolog, and a custom metrics library, and will let Splunk do it's thing. It's a shame, as distributed tracing is very desirable, but not desirable enough to ignore libraries that have chosen concrete dependencies vs interfaces and adapters.
Fundamental things like this need to be part of the standard library, or have zero external dependencies. Rust does (in my opinion) a better job here despite requiring on average more libraries to achieve things, because features can be switched off via Cargo.toml, reducing the overall dependency set.
(note: I'm not a maintainer in the Go SIG so I can't speak towards these issues with much authority)
Definitely interested in how things can be improved here. What are you generally looking to accomplish? Hopefully it's just a matter of docs not accurately expressing what's possible today (a consistent theme for quite some time that's getting resolved, but it takes time).
As for the dependency issues, is this something you'd have the time to file an issue about[0], or make an agenda item for the next Go SIG? At the end of each SIG there's also a moment where the SIG leader asks for stories about adoption (good or bad) and this would clearly count. I'd like to think that these issues are solvable, or at least solvable enough to make it worth it for you to adopt.
I think the aim is to have a single "sink" for metrics, logs and tracing, and it's clear that OpenTelemetry is aiming to be that. Ideally from a single "go get", one should be able to pull in a library without additional dependencies that:
- permits recording of structured logs (perhaps the stdlib interface under discussion will help here!). From a personal perspective I'd rather not have an API that accepts key-value pairs as varargs, but if that was the only thing available I'd take it I think.
- permits recording of counters, gauges, timers and histograms,
- permits recording of spans, and associating both logs and metrics with those spans.
The Go library README shows the logs section as "frozen" and states that no pull requests to improve logging are being accepted. With no visibility into the process (which is, to be clear, because I haven't looked into it and not the fault of the project per se!), it looks like this might be an option in future but not today?
I see! Yeah, this is one where where otel-go is a lot harder to use, but it's something the SIG is looking at. A coworker of mine is helping drive a design that's sort of an "easy button" to configure all the things with the least-surprising defaults[0] and we're seeing how people like it in our SDK distribution that uses it[1]. I hope that sometime soon we'll have the design polished-up enough to get merged in. Like most OSS projects, it'll take some time but I'm confident we can get it done.
The main challenge is that there's a large variety of use cases to fulfill (e.g., someone wants custom context propagation, a custom span processor, and export over HTTP+json but not HTTP+protobuf) and today the answer to that is that you have to pull in all the libraries for all the things you need. It's a lot more energy you need to expend to get started with all of this than it needs to be.
As for logging support in the Go SDK, it's frozen mostly just due to lack of bandwidth and a need to finish what's already been started. Metrics have proven to be much more difficult and time-consuming to implement correctly across all languages, with Go being impacted harder than other languages (e.g., Python and .NET). I think you can expect logging integrations in the near-ish future though.
This is great feedback. I'll pass it on folks who haven't seen it. Thank you! And please feel free to file issues about all the things that rub you the wrong way
.NET, Python, and Java have OTLP exporters for both today. Logs are still mostly alpha/experimental today, but stable in the data specification, so you can start using them together today.
I'm not necessarily a fan of OpenTelemetry, as every time I looked at it it was still "in progress" (though the last time was almost 2 years ago), but in general I agree.
We started with tracing+logging, but as time passed we moved more and more stuff into tracing. But not just span-per-service kind of tracing - detailed orchestration on the level of "interesting" functions, with important info added as tags.
Right now we use almost exclusively tracing (and metrics ofc) and it's working great, even if each trace often has hundreds of spans.
In our case we use Datadog, which works well for most bits and purposes, but we also send all traces over Firehose to S3 so that we can query them with Athena at 100% retention. This way you get your cake and can eat it too.
If you've got the time for instrumenting something relatively small, it's definitely worth picking up OTel again to see how it works for you. Tracing is stable in most languages (Go included) and there's a lot more support in each language's ecosystem.
FWIW I think OTel will always be in a state of "in progress" though. There's more scenarios and signal types that will get picked up (profiling, RUM) and more things to spread across the ecosystem (k8s operator improvements, collector processors for all kinds of things, etc.) and more integration into relevant tools and frameworks (.NET does this today, maybe flask or spring could do it tomorrow). But I wouldn't let that stop you from trying it out again to see if you can get some value out of it!
Log the things that require a human to fix something every time they happen (Include useful info that’s needed for fixing). Create metrics of the rest: things that show patterns. Trace a percentile of actions that span over multiple components.
I believe all the three: logging, metrics, tracing have their own uses, one can’t replace the other.
Would love to learn which areas of OTel you find are onerous to set up and maintain. Experiences can differ from language to language which is why I ask. For example, with JS/Node there's a convenient SDK package and autoinstrumentation metapackage, but with Go you need to install a lot more stuff to get an equivalent experience.
We mainly use go and java, with different dbs and way too many http clients. But the instrumentation is even just a small part of it, we actually had most services fully instrumented with OpenTracing, but then they decided to abandon OpenTracing for OTel... And even then, the more difficult thing is running the trace ingestion without it crashing or running out of resources all the time, and balance that with just having everything disabled. Our most stable setup was just running with the in memory store so we at least have a trace for a few hours when we need it, but we essentially just gave up.
These traces can be rendered as strings for people who like their logs in files