It uses a background thread to do most of the work, and it appears the 7ns latency numbers are a little cooked:
1. The paper's 7ns like number is 8ns for microbenchmarks but 18ns in applications. The 7ns number I'm guessing is microbenchmarks, and the true application level number is prob more in the 17ns range.
2. It isn't precisely clear what that is measuring. The says that is the invocation time of the logging thread. Considering the thread making the call to log just passes most of the work to a background threads through a multi-producer single consumer queue of some sort, this is likely the time to dump it in the queue.
So you really aren't logging in 7ns. The way I'm reading this is you're dumping on a queue in 17ns and letting a background thread do the actual work.
The workload is cut down by preprocessing the creating a dictionary of static elements do reduce the I/O cost of the thread doing the actual writing (I assume this just means take the format strings and index them, which you could build at runtime, so i'm not sure the pre-processing step is really needed).
My logger than dumps binary blobs onto a ring buffer for another process to log might be able to beat this invocation latency. This isn't really groundbreaking. I know a few place that log the binary blobs and format them later. None of them do the dictionary part, but when that is going to a background thread, I'm not sure how much that matters.
Yes, the overhead in the logging thread is what this is trying to minimize. The background thread is considered "free". This sort of async logging is a common setup for some class of applications.
And yes, it boils down to writing data to a message queue. Most of the overhead is probably the call to the hardware timestamp counter.
In my logging code I wrote that is basically a SPSC ring buffer, I use some RDTSC assembly and at startup I calculate the frequency and epoch offset. It has a throughput of around 30 cycles.
That's already ~10 ns, so I'm not sure how they are getting their numbers. If they are timestamping the data when the background thread gets to it that would be pushing even more work to it. It guessing they do or else they could potentially be logging out of order data with multiple threads.
In a similar SPSC ring buffer setup I see something like 1.8ns per 16B entry without RDTSC (our design has a cached thread-local tsc that is updated periodically, so this is vaguely reasonable) or 7.6ns per entry with RDTSC (approximate RDTSC cost of 4.8ns). This is on a VM on some sort of Skylake CPU (maybe Skylake X).
You could store the current date and time in a global variable and have the producers just read it atomically.
The consumer thread would then update it periodically.
Timestamps will be somewhat inaccurate but it may help performance.
that's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.
The fastest for nanosecond precision (bonus is this is even sub nanosecond) is just to store the return value of RDTSC and let the background thread figure it all out. You don't even need to precalcuate the freq or epoch offset. Just write a couple logging messages of the rdtsc value and CLOCK_REALTIME and let the post processing figure it out.
To cut down on I/O each log message's timestamp can just be an offset from the last even.
If you are willing to push a lot of work to the background thread and even more to the post processsing step, you really don't need to do very much.
> hat's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.
Not sure it matters a lot of to have multiple messages with the same timestamp, since they were added in order you still know which one is older, the problem might arise when you send those logs to a remote place and the order of insert is discarded and the timestamp is used instead.
I assume that when you use a single thread with a queue / ring buffer the order of insertion is kept.
probably in this case it's important to use some kind of synthetic timestamping to preserve the ordering (for example, for a 5.123 [ms] timestamp one can add ~1000000 ns timestamps, so let's say are a thousand entries that need to be ordered, one can then represent them as 5.123000[n] ... and the "000" part is just a silly in-band way to give a hint to someone who will later examine the logs)
since you aren't going to be writing a message per nanosecond, you can always just do `last nanos = max(last nanos + 1, cur nanos)` and then use last nanos for the timestamp. you can even do it in rdtsc ticks and get 1/3 of nano values. Obv the clock isn't nearly that accurate, but it lets you use those fractional nanos to ensure a strictly increasing ordering.
It was at first. Until it became obvious that it was useless like that.
You had the OS down-clocking cores to avoid CPUs cooking themselves, causing time to increase an inconsistent rates. Your app queried the clock rate, but unless it kept querying it, the number of cycles was meaninglesss.
ie:
(CPU rate changed to 1GHz)
Query Rate, 1GHz.
(CPU rate changed to 700MHz)
Read start time
Read end time
convert delta into time, using 1GHz rate... not accurate.
or
(CPU rate set to 1GHz)
Read start time
Read end time
(CPU rate set to 700MHz)
Query rate, 700Mhz.
convert delta into time, using 700MHz rate... not accurate.
Next you had multicore, and the OS would down-clock or turn off some idle cores, again, perhaps they were throttled to avoid overheating. Again, the clock rate would change dynamically, but per core. Not only did time not increase at a constant pace, but if a thread jumped between cores, time could appear to go backwards!
It was a mess. Windows introduced a call to get time in a safe way, as they knew when they changed CPU frequencies. Intel fixed the hardware to use a constant clock after a just a few models, but the damage is done... It's invariant...usually, but not for certain.
I did something similar about a decade ago. the main active costs for logging (assuming writing into a fixed sized lockless ring buffer) is for doing the string formatting and timestamp fetching. I actually did my own ts interpolation with rdtsc and periodic fetches to get this to minimal. for string formatting you could completely avoid by preregistering messages and store away arguments.
I have also seen people make the logger into dso so they can dump strings from main app binary and record addresses to avoid the hassel of preregistering. but with preregistering I was able to get the perf all the way down to ~40ns per message w/o the separate log dumper thread. at that point you actually start running into disk IO limitations.
one additional thing I did was to create the ring buffer in a shared memory segment and log to that from all services in the system. so this way you dont even have to think about crashing services. anyway the point is this hardly seems groundbreaking. what I'd like to see is this evolve into a fast syslog type service when one just freely logs trace messages throughout w/o worry about the cost and we can use them for postmortem analysis or filter at dump.
Only thing that makes sense is that the thread sending the logs is blocket for 7ns - otherwise too much context dependent extra comes in to make a claim like this
Yeah. But that’s okay. For the applications that care about this, it’s just a question of how long the application thread is blocked. The logging thread can spin on its core and take as long as it wants so long as it can burn down the buffer faster than the application thread can add to it.
I thought log4j2 used a disruptor in between app thread and logging thread but maybe I’m wrong.
One of the reasons I think logging as a production tool has risen in status in the last fifteen years is that in the 90’s you had languages where it was decided that if your app crashes we want to know what the last operation that was in flight when it crashed.
As we now know the syscalls to force flush to disk often didn’t, so if you were trying to figure out what happened right before a kernel panic, good fucking luck. But if it was a purely userspace problem, you might be okay.
These days we care more about throughput. I’d personally rather see us commit to stats for use cases that require that sort of logging traffic, but I think that’s one of those conversations that’s nearing the top of the heap. Maybe we’ll have that argument as soon as we stop flooding the literature with talk of AI.
So 7ns tells me something about throughput but I think i’d rather just know the messages per second per CPU. And perhaps more importantly, with a CPU that’s 80% saturated doing non-logging tasks. Because I need to make forward progress between the log messages.
I've used https://github.com/rokath/trice which is similar but targeting microcontrollers. It isn't immediately clear to me if nanolog could run on a microcontroller with its output directed over RTT/uart/etc or not.
The consumer side of this would be polling a memory location for new logs, correct? It would not be possible to wake up the consumer in 7ns as that would take a FUTEX_WAKE system call with is O(microseconds).
I've been wondering about a FUTEX_WAKE that does not require a system call. Possibly, the kernel could poll a global memory area. Or maybe there is some low-level debugging API available where the kernel could be notified of a memory write by a process?
The background thread can just sleep if no data in the queue.
Since you are preprocessing the log messages and you know the minimum bounds. If the thread sleeps for 1 ms. Be generous and say you can at most push to the queue in 5ns, and if you know the largest messages you push will be 200 bytes, you can statically determine a 40M ring buffer will ensure enough space to sleep even at max rate with the largest messages.
And that's just a simplstic scheme. If you have a pool logging structs so your queue is just a pointer to one of those, you can get away with much less even.
And I can think of faster ways too. It isn't that difficult to get around using a semaphore
> There isn't a significant advantage in having the kernel doing the polling, it would still be busy polling.
I was thinking in terms of a generic syscall-less wake functionality where the kernel could do this for all processes in the system. So you'd lose one core per system instead if one core per consumer.
Interesting. Could be used to make the kernel loop above burn less power.
A user-space implementation could presumably also be built. There could be a shared memory segment shared between producers and a monitor. A producer sets a flag in case it needs attention, and the monitor busy polls the segment. The monitor could then use e.g. a signal to wake up consumers.
The latency between the producer signaling and the consumer taking action would be a higher than with futexes. But there would be no waits/context switches in the producer at all. Might be a solution for some low latency use cases.
The consumer (background thread) only polls; there is no need to wake up the consumer.
At steady state, the consume is either blocked on I/O or is scanning the producer buffers. When all producer buffers are empty, it can just sleep for a short time.
This would work in this specific case where we know that there is a maximum rate at which work is produced. Arguably I was hijacking the thread to discuss a more general problem that I've been thinking about for a while. I have the sense that a ring-buffer that has a wait-free push with low bound on consumer side latency that doesn't require fixed interval polling would be a nice primitive that I certainly could have used at times.
And in fact ... the wait-free wakeup part of this is already there. Now that io_uring has futex support, a producer can enable kernel-side busy polling on the uring, and then submit a FUTEX_WAKE to the ring without doing any sys calls. This Github issue [1] has a nice description.
Interesting, thanks for sharing. I did some digging on the linux-kernel mailing list and it seems that after the first patch series in Dec ‘21 and the LPC talk there was no further work on this. Too bad if true. Also I wasn’t able to easily find which hardware supports this today. I did see a reference to a RISC-V project that is adding a similar feature.
That seems like a type error on your part. O(unit) is a categorically different notation than O(mathematical function). You must be confusing it with the asymptotic notion that is the latter. But units are not functions and so clearly it cannot be the same notation.
One might say it's also pretty petty to call out a casual usage of a notation being used in a way that people in the thread are understanding just fine...
7ns latency is in the ballpark of small writes to L1 cache. I.e. some sort of in-mem only append to somewhere predicted by the prefetcher of like, a single cache line or less. So yeah, some sort of ringbuffer log could definitely support this kind of figure. The question is how much throughput does your persistence channel have, how much memory can you devote to logging while your persistence runs async, are you ok with a somewhat volatile log, etc.
you don't really wait on L1 cache writes though. The store buffer absorbs it, and if the data is needed it can be forwarded from there before the write to cache even happens.
Most x64 L1d caches have a 4-6 cycle latency depending on CPU, that 1 to 2 ns depending on frequency.
Doesn't a logging system need a storage system that can keep up with it, if the goal is to persist logs for later analysis? What storage could keep up with this?
I think the idea here is to separate the log call site in application code from the log processing/persistence?
So, the nanosecond values quoted are the impact exposed to your application code, but some other process takes over the bulk of the work of the logging.
So as long as the offloaded process can keep up with the average log rate, it'll be fine - but also as a bonus the application does not see increased latency due to logging.
I have ideas for a logging/wide metric system that uses this technique and some others stolen from DNS and IP. It's largely inspired by a combination of a system I've built at my day job that implements distributed command & control for servo-ing, monitoring, etc. It's been really successful, but the hardest part is mapping a unique numeric identifier to a human readable string in a way that is dynamic and efficient enough. It really seems like the exact same problem as DNS, which leads me to believe there's likely no way without a persistent centralized registry/database.
The paper says a lot of the secret sauce is dumping a dictionary of the static content and then logging in a binary format. That format looks a lot like gzip, if you squint. Could something like this use the _actual_ gzip format, but writing with a static dictionary, to make life easier for tools? (gzip has a trailer, but I'm not sure how much attention is paid to that, since it's often used for streams)
So you're actually spawning threads in this simple C++ example? I thought this was refrained in C++ land...
#include "NanoLogCpp17.h"
using namespace NanoLog::LogLevels;
int main()
{
NANO_LOG(NOTICE, "Hello World! This is an integer %d and a double %lf\r\n", 1, 2.0);
return 0;
}
Amazing work! I was wondering just a few months ago whether someone ever made a logger that deferred all the expensive work of string formatting to consumption time.
~~I'm a bit surprised that it didn't come along sooner though. How come nobody at Google or VMware who said they noticed this was a problem solved it? Or any other major tech company? I guess maybe this is partially an issue with our programming languages and build tools though? I'm a Zig enthusiast though so in my head it's trivial, but I guess it won't be until C++26 that they get potentially comparable comptime facilities for C++.~~ I'm surprised Go doesn't work like this by default though. For a language like Go, I'd have made a builtin log keyword that does this.
EDIT: Looks like other implementations of similar ideas do exist. Still awesome though!
You have to be careful in deferring such work. It may end up more expensive if it means you have multiple threads accessing that data, and/or needing to extend the lifetime of an object so the logger can access it.
as long as you are just using static strings and native types it amounts to a pointer/index bump and a load/store per item. Lets imagine you have the format string, priority number, system id, and 7 pieces of data in the payload. That would be 10 items, so like 40 cycles? I can see the 18ns the paper gets.
Are there libraries like this, but in rust? As far as I understand, it relies on C's preprocessor, so it is impossible to create bindings for another language.
I have not used this and it says it targets embedded systems but maybe it is close enough to what you might be looking for: https://github.com/knurling-rs/defmt
I have used this, but not the library in the link. From the link’s README, they’re at least analogous. While maybe not the exact same thing, they’re at least the same idea.
Sounds like something that would be doable in rust. I'm not sure how you would go about building the compile-time table of log locations and format strings, sounds like the hardest part.
I believe the performance reported in their paper is circumstantial. It’s not that much faster when I tried it, and not worth the horrible macro syntax.
spdlog is designed as a general purpose logging library and it can’t beat low latency loggers. It doesn’t scale for multiple threads because it’s async mode is using a mutex and a cv to notify the background thread.
I was also thinking the same~
How come such a good idea is already not part of standard logging libraries -- to allow you to configure to another process or message queue!
1. The paper's 7ns like number is 8ns for microbenchmarks but 18ns in applications. The 7ns number I'm guessing is microbenchmarks, and the true application level number is prob more in the 17ns range.
2. It isn't precisely clear what that is measuring. The says that is the invocation time of the logging thread. Considering the thread making the call to log just passes most of the work to a background threads through a multi-producer single consumer queue of some sort, this is likely the time to dump it in the queue.
So you really aren't logging in 7ns. The way I'm reading this is you're dumping on a queue in 17ns and letting a background thread do the actual work.
The workload is cut down by preprocessing the creating a dictionary of static elements do reduce the I/O cost of the thread doing the actual writing (I assume this just means take the format strings and index them, which you could build at runtime, so i'm not sure the pre-processing step is really needed).
My logger than dumps binary blobs onto a ring buffer for another process to log might be able to beat this invocation latency. This isn't really groundbreaking. I know a few place that log the binary blobs and format them later. None of them do the dictionary part, but when that is going to a background thread, I'm not sure how much that matters.