I think the last sample needs a `fba.reset()` call in between requests.
BTW, I used zig a lot recently and the opaque allocator system is great. You can create weird wrappers and stuff.
For example, the standard library json parser will parse json, deserialize a type that you requested (say, a struct). But it needs to allocate stuff. So it creates an arena for that specific operation and returns a wrapper that has a `deinit` method. Calling it deinits the arena so you essentially free everything in your graph of structs, arrays etc. And since it receives an upstream allocator for the arena, you could pass in any allocator. A fixed stack allocator if you wish to use stack space, another arena, maybe jemalloc wrapper. A test allocator that checks for memory leaks.. Whatever.
Unit tests are trivial because you can probably use a single arena that is only reset once at the end of the test. Unless the test is specifically to stress test memory in some form.
> I assume the receiver then has to know it has to clone all of those values, yes?
The receiver needs to understand the lifetime any which way. If you parse a large JSON blob and wish to retain arbitrary key/values you have to understand how long they're valid for.
If you're using a garbage collection language you can not worry about it (you just have to worry about other things!). You can think about it less if the key/values are ref-counted. But for most C-like language implementations you probably have to retain either the entire parsed structure or clone the key/values you care about.
I assume the following is perfectly doable from a technical perspective but is there any community support for using multiple allocators in this case to, eg parse general state to an arena and specific variables you want to use thereafter to a different allocator to remain long lived?
You can pass the json deserializer an allocator that is appropriate for the lifetime of the object you want to get out of it, so often no copying is required.
I've mainly written unusual code that allocates a bunch of FixedBufferAlocators up front and clears each of them according to their own lifecycles. I agree that more typical code would reach for a GPA or something here. If you're using simdjzon, the tape and strings will be allocated contiguously within a pair of buffers (and then if you actually want to copy from the tape to your own struct containing slices or pointers then you'll have to decide where that goes), but the std json stuff will just repeatedly call whatever allocator you give it.
Well, see, the problem is that you’re going to have to keep the allocator alive for the entire lifetime you use any data whatsoever from the JSON. Or clone it manually. Both seem like they largely negate any benefit you get from using it?
It would prevent you from writing the bug at the top of the thread.
I have stopped considering this sort of thing as a potential addition to the language because the BDFL doesn't like it. So realistically we must remember to write reset, or defer deinit, etc. This sort of case hurts a little, but people who are used to RAII will experience more pain in cases where they want to return the value or store it somewhere and some other code gains responsibility for deinitializing it eventually.
On the other hand, is more clear where things are released. When over relying on destructors, often it becomes trick to known when it happens and in which order. This kind of trade off is important to take into consideration depending on the project.
Maybe I'm too used to C++ and Rust but I don't find it tricky at all.
Its very clearly defined when a destructor (or fn drop) is called as well as the order (inverse allocation order).
What I would like to see would be some way of forcing users to manually call the dtor/drop.
So when I use a type like that I have to manually decide when it gets destroyed, and have actual compile checks that I do destroy the object in every code path.
I will admit that I really miss the "value semantics" thing that RAII gives you when working in Zig. A good example is collections, like hash tables or whatever: ownership is super clear in C++/Rust. When the hash tables goes away, all the contained values goes away, because the table owns its contents. When you assign a key/value, you don't have to consider what happens to the old key/value (if there was one), RAII just takes care of it. A type that manages a resource has value semantics just like "primitive" types, you don't have to worry.
Not so in Zig: whenever you deal with collections where either the keys or values manages a resource (i.e. does not have value semantics), you have to be incredibly careful, because you have to consider lifetimes of the HashMap and the keys/values separately. A function like HashMap.put is sort of terrifying for this reason, very easy to create a memory leak.
I get why Zig does it though, and I don't think adding C++/Rust style value semantics into the language is a good idea. But it certainly sometimes makes it more challenging to work in.
Among systems languages, I've mostly used C and Zig. I don't think dtor order is tricky so much as I think that defaulting to fine-grained automatic resource management including running a lot of destructors causes programs that use this default to pay large costs at runtime :(
I think the latter problem is impossible, you end up needing RAII or a type-level lifetime system that can't express a lot of correct programs. I would like something that prevents "accidentally didn't deinit" in simple cases, but it probably wouldn't prevent "accidentally didn't call reset on your FixedBufferAllocator" because the tooling doesn't know about your request lifecycle.
You want a code analyzer on top of a language like zig. I think people think it's hard because it would really be hard for C. Probably would be MUCH easier in zig.
Because you can have an arbitrary number of object that can all be freed in O(1), instead of traversing a tree and calling individual destructors. An arena per object makes no sense
I'm not 100% sure how Zig allocators work but it looks like the arena memory is getting re-used without zeroing the memory? With slight memory corruption freed memory from a previous request can end up leaking. That's not great.
Even if you don't have process isolation between workers (which is generally what you want) then you can still put memory arenas far apart in virtual memory, make use of inaccessible guard pages, and take other precautions to prevent catastrophic memory corruption.
I guess you could place a zeroing allocator wrapper in between the arena and it's underlying allocator. That would write zero to anything that's getting freed. Arena deinit will free anything allocated from the underlying allocator so upon completion of each request, used memory would be zeroed before returned back to the main allocator.
And that handler signature would still be the same. Which is the he whole point of this article so, yay.
I once spent an utterly baffling afternoon trying to figure out why my benchmark for a reverse iteration across a rope data structure in Julia was finishing way too fast. I was perf tuning it, and while it would have been lovely if my implementation was actually 50 times faster than reverse iterating a native String type, I didn't buy it.
Finally figured it out: I flipped a sign in the reverse iterator, so it was allocating a bunch of memory and immediately hitting the margin of the Vector, and returning it with most of the bytes undefined. Why didn't I catch it sooner? Well, I kept running the benchmark, which allocated a reverse buffer for the String version, which GC released, then I ran the buggy code... and the GC picked up the recently freed correct data and handed it back to me! Oops.
Of course, if you want to avoid that risk in Zig, you just write a ZeroOnFreeAllocator, which zeros out your memory when you free it. It's a drop in replacement for anything which needs an allocator, job done.
In my Zig servers I'm using a similar arena-based (with resetting) strategy. It's not as bad as you'd imagine:
The current alloc implementation memsets under the hood. There are ongoing discussions about the right way to remove that performance overhead, but safety comes first.
Any sane implementation has an arena per request and per connection anyway, not shared between processes. You don't have bonkers aliasing bugs because the OS would have panicked before handing out that memory.
Zig has a lot of small features designed to make memory corruption an unhappy code path. I've had one corruption bug out of a lot of Zig code the last few years. It was from a misunderstanding of async (a classic stack pointer leak disguised by a confusion over async syntax). It's not an issue since async is gone from the language, and that sort of thing is normally turned into a compiler error anyway as soon as somebody reports it.
memset is the golden example of an easily pipelined, parallelized, predictable CPU operation - any semi-modern CPU couldn't ask for easier work to do. Zeroing 8 KB of memory is very cheap.
If we use a modern Xeon chip as an example, an AVX2 store has a throughput of 2 instructions / cycle. Doing that 256 times for 8 KB totals 128 cycles, plus a few extra cycles to account for the latency of issuing the first instruction and the last store to the L1 cache. With a 2 GHz clock frequency, it still takes less than 70 nanoseconds. For comparison, an integer divide has a worst-case latency of 90ish cycles, or 45ish nanoseconds.
Zeroing memory is very cheap, but not zeroing it is even cheaper.
Zeroing memory on deallocation can be important for sensitive data. Otherwise, it makes more sense to zero on allocation if you know that it's needed because the allocated structure will be used without initilazation and the memory isn't zero by guarantee (most OSes guarantee newly allocated memory will be zero, and have a process to zero pages in the background when possible)
Sure, but in most practical applications where an HTTP server is involved, zeroing the request/response buffer memory is very unlikely to ever be your bottleneck. Even at 10K RPS per core, your per-request CPU time budget is 100 microseconds. Zeroing memory will only account for a fraction of a percentage of that.
If you're exposing an HTTP API to clients, it's likely that any response's contents will contain sensitive client-specific data. If memory corruption bugs are more likely than bottlenecking on zeroing out your request/response buffer, then zeroing the request/response buffer is a good idea, until proven otherwise by benchmarks or profiling.
Zeroing on allocation is much more sensible though because that way you preload the memory into your caches as opposed to on deallocation where you bring memory into cache that you know you no longer care about. Also if you do the zero on allocation, the compiler can delete it if it can prove that you write to the memory before reading to it.
This memory is now the least recently used in the L1 cache, despite being freed by the allocator, meaning it probably isn't being used again.
If it was freed after already being removed from the L1 cache, then you also need to evict other L1 cache contents and wait for it to be read into L1 so you can write to it.
128 cycles is a generous estimate, and ignores the costs to the rest of the program.
Nontemporal writes are substantially slower, e.g. with avx512 you can do 1 64 byte nontemporal write every 5 or so clock cycles. That puts you at >= 640 cycles for 8 KiB.
https://uops.info/html-instr/VMOVNTPS_M512_ZMM.html
Well, the point of a non-temporal write kind of is that you don't care how fast it is. (Since if it was being read again anytime soon, you'd want it in the cache.)
The worker is already reading/writing to the buffer memory to service each incoming HTTP request, whether the memory is zeroed or not. The side effects on the CPU cache are insubstantial.
This might be a stupid question, but why isn't zeroing 8KB of memory a single instruction? It must be so common as to be worthy that all the layers of memory (and indirection) to understand that.
For 8kb? Syscalling in to the kernel, updating the processes’s memory map and then later faulting is probably slower by an order of magnitude or more compared to just setting those bytes to zero.
Memcpy, bzero and friends are insanely fast. Practically free when those bytes are in the cpu’s cache already.
Probably still cause a page fault when the memory is re-accessed though. I suspect even using io_uring will still be a lot slower than bzero if you're just zeroing out 2 pages of memory. Zeroing memory is really fast.
128-bit or 256-bit memsets via SIMD instructions are sufficient to saturate RAM bandwidth, so there wouldn't be much of a gain from having a dedicated instruction.
(By the way, x86 does have a dedicated instruction--rep stosb--but compilers differ as to how often they use it, for the reason cited above.)
Compilers can remove the memset if they can show it is overwritten prior to use (though C and C++ UB could technically make it possible to skip padding they don’t), or it isn’t used (in which case we go back to non-zero’d memory again which in this scenario we’re trying to avoid).
There are various _s variants of memset, etc that require the compiler to perform the operations even if it “proves” the data cannot be read.
And finally modern hardware has mechanisms to say “this is now zero” and not actually zero the memory and instead just tell the MMU that the region is now zero (which removes the cpu time and cache impact of accessing the memory directly).
On macOS and iOS I believe all memory is now zero’d on free and I think malloc ostensibly therefore guarantees zero’d memory (the problem I think is whether calloc tries to rely on that behavior, because then calloc can produce non-zero memory courtesy of a buffer overrun/UaF after free has ostensibly zero’d memory)
I don't have anything for you, but if you have some normally allocated hierarchal data structures in order to free them you'll have to go through their members, chase pointers, etc., to figure out the addresses to free, then call free on them in sequence. That's all going to be a lot more expensive than just memsetting a bunch of data to zero, which you can do at whatever the speed of your cores memory bandwidth is.
Yep. And you often don’t even need to zero the data.
Generally, no paper or SO answer will tell you where your program spends its time. Learn to use profiling tools, and experiment with stuff like this. Try out arenas. Benchmark before and after and see what kind of real performance difference it makes in your own program.
Generally, you 0 on free in secure environments to avoid leaking secrets from 1 section of knowledge to the next. ie a request may contain a password, which the next request should not have access to.
Depends where you draw the line. An arena allocator per request needs to be managed at least by an app framework, if not the application. It's all layers of abstraction, and one of those layers needs to 0 memory.
The arena allocator implementation for general uses absolutely should not do the zero work. This is specific use case, which can be implemented in an app-specific custom allocator.
That's not what I said. My point was that an arena allocator has to be managed at a relatively high level. Similarly, an allocator responsible for 0 on free would be managed at a similar level. They are orthogonal concepts as you say, but there's no reason 0 on free can't be managed by an allocator.
The inspiration for the post came from my httpz library. The fallback using a FixedBufferAllocator + ArenaAllocator is used. The fixed buffer is a thread local. But the arena allocators belong to connections, of which there could be thousands.
You might have 1 fixed buffer, for N (500+) ArenaAllocators (but only being used per one at a time). This allows you to allocate a relatively large fixed buffer since you have relatively few threads.
If you just used retain_with_limit, then you'd either have to have a much smaller retained size, or you'd need a lot more memory.
I have to admit I don’t really understand the problem Zig is trying to solve. If you’re not trying at the language level to address the core problems of C/C++, like Rust is, then it seems like you’re just making a more ergonomic version of those languages and that’s not enough to overcome how deeply entrenched they are.
Disclaimer: I don't really write i anything in any of the four languages mentioned, so this is just my impression from the outsid looking in, but I'm sure that for many people the ergonomics are one of the core problems of C/C++.
Matklad also had an interesting take in one of their blogs: if you have to write unsafe code, then Zig is actually both a more ergonomic language than Rust, and easier to write "safe" unsafe code in.
Zigs tight C integration makes it easy to just start using Zig in an existing C codebase, which is a great way to overcome the challenges you've mentioned. It doesn't need to replace C all at once, just slowly.
Sure but is that really enough to get buy in on a whole new language? Especially when the new language still leaves open the door for so many of the critical problems with C?
Dangling references is about the only C problem that Zig doesn't fix with language/compiler features, everything else is taken care of and mostly in a quite elegant and minimalistic way. Also Zig doesn't need to replace C to be successful, just augment it - and for that it's a already a really good choice.
Also, it doesn’t really have a solution. There are safe ways to concurrently access the same memory region, that rust disallows without unsafe, and if you are writing a high-performance language runtime you might well need these features.
A couple years ago I evaluated Rust for a low-latency application that does math and has a websocket and http client. I didn't use Rust because basically every library for speaking these protocols thought that it was fine to call the global allocator, or run UTF-8 validation on the returned buffers, or pull in a huge async runtime, etc.
That's great but if you're getting paid to write that code you're not getting paid to enjoy it. You're getting paid to write code that doesn't have bugs and security holes.
Our profession's tendency to take the easy way out on tech choices is eventually going to lead to the kind of regulation and certification other engineers are subject to.
Basically it is Modula-2 with a C facelift, and compile time metaprogramming, for those that want Safe C.
Which means, while the safety is much better than raw C, with proper strings and arrays, alongside stronger type system, it still has UAF as possible gotcha.
Not me, not yet, and it's been a few years since I've used Blaze.
It ought to be fairly straightforward. Zig is an easy dependency to either vendor or install on a given system/code-base (much more painful currently if you want Blaze to also build Zig itself), and at a bare minimum you could just add BUILD steps for each of the artifacts define in build.zig.
Things get more interesting if you want to take advantage of Zig's caching, especially once incremental compilation is fully released. It's a fast enough compilation step that perhaps you could totally ignore Zig's caching for now and wait to see how that feature shapes up before making any hard decisions, but my spidey senses say that'll be a nontrivial amount of work for _somebody_ to integrate those two ideas.
I happen to agree. When I see 'fn' I "hear" function, when I see 'func' I hear "funk, but misspelled".
Also, with four space indentation (or for Go, a four space tabset), 'func' aligns right where the code begins, pushing the function name off one space to the right. For 'fn' the function name starts one space before the code, I find this more aesthetic. Then again, the standard tabset is eight spaces, so this matters less in Go.
It would be pretty silly to pick a language on the basis of that kind of superficial window dressing, of course. But I know which one I prefer.
You don't understand personal preferences? Or you don't understand the desire to share them with your peers? Or you can't understand why people don't just bully themselves into silence for the benefit of others?
I literally just signed up to ask if anybody can recommend any good Zig codebases to read other than Tigerbeatle. How's your terminal going?
Edit: The rest of the posted site seems like a treasure trove not just this one article. Was wondering how to get into Zig and here we are. Such kismet.
If you're looking for interesting Zig codebases to read, you might be interested in our low-level audio input/output library[1] or our module system[2] codebase - the latter includes an entity component system and uses Zig's comptime to a great degree to enable some interesting flexibility (dependency injection, global view of the world, etc.) while maintaining a great amount of type safety in an otherwise dynamic system.
I agree, the lack of control is frustrating but on the contrary: how much software is actually going to do anything useful if allocation is failing? Designing your std library around the common case then gathering input on what memory fallible APIs should look like is smarter IMO.
Most problems have speed/memory/disk tradeoffs available. Simple coding strategies include "if RAM then do the fast thing, else do the slightly slower thing", "if RAM then allocate that way, else use mmap", "if RAM then continue, else notify operator without throwing away all their work", ....
Rust was still probably right to not expose that at first since memory is supposed be fairly transparent, but Zig forces the user to care about memory, and given that constraint it's nearly free to also inform them of problems. The stdlib is already designed (like in Rust) around allocations succeeding, since those errors are just passed to the caller, but Zig can immediately start collecting data about how people use those capabilities. At a language level, including visibility into allocation failures was IMO a good idea.
andrewrk sees this as a failure but doesn't yet have a solution. It seems difficult to provide users with features they want, such as recursion and C interop, while also statically proving the stack is sufficiently small.
In programs that don't have stack overflows, it's nice to be able to handle allocation failures though :)
The Rust standard library aborts on allocation failure using the basic APIs, but Rust itself doesn't allocate. If someone wanted to write a Zig-style library in Rust, it would work just fine.
These kind of tactics work for simple examples. In real world http servers you'll retain memory across requests (caches) and you'll need a way to handle blocking io. That's why most commonly we use GC'd/ownership languages for this + things like goroutines/tokio/etc.. web devs dont want to deal with memory themselfs.
It scales to complex examples as well. Retained memory would be handled with its own allocator: for a large data structure like an LRU cache, one would initialize it with a pointer to the allocator, and use that internally to manage the memory.
Blocking (or rather, non-blocking, which is clearly what you meant) IO is a different story. Zig had an async system, but it had problems and got removed a couple point releases ago. There's libxev[0] for evented programs, from Mitchell Hashimoto. It's not mature yet but it offers a good solution to single-threaded concurrency and non-blocking IO.
I don't think Zig is the best choice for multithreaded programs, however, unless they're carefully engineered to share little to no memory (using message passing, for instance). You'd have to take care of locking and atomic ops manually, and unlike memory bugs, Zig doesn't have a lot of built-in support for catching problems with that.
A language with manual memory allocation isn't going to be the language of choice for writing web servers, for pretty obvious reasons. But for an application like squeezing the best performance out of a resource-constrained environment, the tradeoffs start to make sense.
Off the top of my head, I was wondering... for software like web services, isn't it easier and faster to use a bump allocator per request, and release the whole block at the end of it? Assuming the number of concurrent requests/memory usage is known and you don't expect any massive spike.
I am working on an actor language kernel, and was thinking of adopting the same strategy, i.e. using a very naive bump allocator per actor, with the idea that many actors die pretty quickly so you don't have to pay for the cost of GC most of the time. You can run the GC after a certain threshold of memory usage.
The problem _somebody_ between the hardware and your webapp has to deal with is fragmentation, and it's especially annoying with requests which don't evenly consume RAM. Your OS can map pages around that problem, but it's cheaper to have a contiguous right-sized allocation which you never re-initialize.
Assuming the number of concurrent requests is known and they have bounded memory usage (the latter is application-dependant, the former can be emulated by 503-erroring excess requests, or something trickier if clients handle that poorly), yeah, just splay a bunch of bump allocators evenly throughout RAM, and don't worry about the details. It's not much faster though. The steady state for reset-arenas is that they're all right-sized contiguous bump allocators. Using that strategy, arenas are a negligible contribution to the costs of a 200k QPS/core service.
If you never cache any data. Sure, u can use a bump allocator. Otherwise it gets tricky. I havent worked with actors really, but from the looks of it, it seems like they would create alot of bottlenecks compared to coroutines. And it would probably throw all your bump allocator performance benefits out the window. As for the GC thing. You cant 'just' call a GC.. Either you use a bump allocator or you use a GC. Your GC cant steal objects from your bump allocator. It can copy it... but then the reference changes and that's a big problem.
I think this comment assumes that you're using one allocator, but it's probably normal in Zig to use one allocator for your caches, and another allocator for your per-request state, with one instance of the latter sort of allocator for each execution context that handles requests (probably coroutines). So you can just have both, and the stuff that can go in the bump allocator does, and concurrent requests don't step on each others toes.
Have you looked at how Erlang does memory management within its processes? You definitely can "get away" with a lot of things when you have actors you can reasonably expect will be small scale, if you are absolutely sure their data dies with them.
The trick to Erlang's memory management is that data is immutable and never shared, so all the complication and contention around GC and atomic locks just disappear.
The key thing (as I understand it) is that each process naturally has a relatively small private set of data, so Erlang can use a stop-the-process semispace copying collection strategy and it's fast enough to work out fine.
Since nothing can be writing to it during that anyway, I'm not sure the language level immutability makes a lot of difference to GC itself.
This example came from a real world http server. Admittedly, Zig's "web dev" community is small, but we're trying :) I'm sure a lot could be improved in httpz, but it's filling a gap.
You can use these patterns for per-request resources that persist across some I/O calls using async if you are on an old version of Zig or using zigcoro while you wait for the feature to return to the language. zigcoro's API is designed to make the eventual transition back to language-level async easy.
BTW, I used zig a lot recently and the opaque allocator system is great. You can create weird wrappers and stuff.
For example, the standard library json parser will parse json, deserialize a type that you requested (say, a struct). But it needs to allocate stuff. So it creates an arena for that specific operation and returns a wrapper that has a `deinit` method. Calling it deinits the arena so you essentially free everything in your graph of structs, arrays etc. And since it receives an upstream allocator for the arena, you could pass in any allocator. A fixed stack allocator if you wish to use stack space, another arena, maybe jemalloc wrapper. A test allocator that checks for memory leaks.. Whatever.