Hacker News new | past | comments | ask | show | jobs | submit login
Fast, simple, hard real time allocator for Rust (github.com/pcwalton)
207 points by practicalrs 6 months ago | hide | past | favorite | 58 comments



I wrote almost the exact same thing after seeing Sebastian Aaltonen's original offset allocator repo which inspired me to write my own clone in Rust.

It's possible to further improve the fragmentation characteristics if the maximum size of the memory to be allocated is known ahead of time. The original used a 5.3 "floating point" scheme for the free bins, but given the maximum size you can opt for 4.4 or 6.2 or whatever can cover the whole region, giving more dense (or sparse) allocation granularity as needed.

This same idea can also be extended to allocating texture atlas regions when the regions have power of two size. 256 bins can cover all possible texture sizes that GPUs can handle, so you can get hard O(1) texture packing.

I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README). For that it would be necessary to make a reverse lookup from pointer address to allocation handle, which would require something like a red-black tree covering the address space, which would no longer be 0(1). If anyone has ideas on this front, I would be happy to hear them.

This algorithm is so clever and kind of a game changer. Allocating and freeing are fast real time operations (only dozens of CPU cycles and a few memory writes). It kind of makes "bump allocators" obsolete because this allocator is almost as fast but supports trimming and freeing allocations as well. This algorithm requires more memory but the amount is quite low.


> For that it would be necessary to make a reverse lookup from pointer address to allocation handle, which would require something like a red-black tree covering the address space, which would no longer be 0(1). If anyone has ideas on this front, I would be happy to hear them.

A radix tree can solve this: https://en.wikipedia.org/wiki/Radix_tree

I used one way, way back to do exactly the same thing: upon a free, I needed to look up all of the metadata for an address. For a 32-bit address space, you can just allocate a giant array up front, and use the address as an index. For a 64-bit address space, it's obviously way too big to statically allocate it up front. A radix tree neatly solves the problem. An arbitrarily sized radix tree is not constant lookup, but for reasons I honestly cannot remember, for a 64-bit address space, it's guaranteed to only be 3 levels deep.

See my implementation, which (I believe) I borrowed from tcmalloc: https://github.com/scotts/streamflow/blob/master/streamflow....


Radix trees are not O(1), they're O(log n). Data structures that support constant-time predecessor lookup do not exist, there is a super-constant lower bound, even in the RAM model [1].

Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

Asymptotic bounds must be presented and understood in context.

[1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathematic...


This function is O(1): https://github.com/scotts/streamflow/blob/master/streamflow..... All other tree operations are also constant, because the fact that it is a 3-level tree is hardcoded.

Asymptotic bounds are useful when your input can grow arbitrarily large. When your input is fixed, the bounds become less useful and you should focus on the particulars of your case. In the case I'm presenting, the work done traversing the tree will be the same every time; it is O(1). That doesn't necessarily mean it's fast enough! It still may do too much work for the use case. For instance, I can imagine someone saying that the function above does too much pointer chasing for their use case.


> This function is O(1)

I think I addressed that in my comment, but to be more explicit, this function is O(1) too:

  size_t find_sorted(void* object, void** list, size_t size) {
    for (size_t i = 0; i < SIZE_MAX; ++i) {
      if (i < size && list[i] > object) return i;
    }
    return SIZE_MAX;
  }
If O(1) cannot distinguish your function from this function, what is its informational value?

> Asymptotic bounds are useful when your input can grow arbitrarily large

But your inputs can't grow arbitrarily large, that's why you can hardcode 3 levels. O(1) is an asymptotic bound, and my point is that it is not very informative here.


I'm really not sure what you're overall point here is. Yes, I agree with you. Your function is O(2^64) which is technically O(1). Your point, which I agree with, is that's completely useless information. It does not help our understanding of performance at all. Calling such a function O(1) is technically true, but both misleading and not informative. What I'm not clear on is how that relates to the discussion we're having here.

The original poster said they wanted a O(1) solution to a problem. I presented one. That solution happens to be based on a data structure whose algorithms are, in the general case, O(log n). But we're not dealing with a general case, we're dealing with a specific case. And because of that specific case, we can write algorithms that are O(1). Unlike your example, these algorithms have a very small n; 3, to be exact. That is meaningful to describe as O(1) in this case because we can reduce the work down to a small constant.


My overall point is that neither your function or my function are actually O(1). Whenever you see the notation O(...), there is an implicit context "As input size n grows arbitrarily, ...". You can check the formal definition on Wikipedia.

The cost function for both our functions is not defined for arbitrary n, because they both stop working when input size crosses a threshold. So the O(1) notation is not well-defined in this case.

Now you could come up with a different formal definition for O(1) for bounded input sizes, which is fine, but I don't think you can find one that makes your function O(1) and my function non-O(1). So it would be not be a meaningful definition in this case.

Ultimately, you're using O(1) colloquially. In your words, calling my function O(1) is misleading while it is fine for yours because the constant is "small". "Small" is a subjective term, while O(1) is a formal term.

If your definition hinges on a subjective characterization, why not just say "it's fast", instead of incorrectly using a technical term?

(If we really want to be pedantic, there is really no such thing as "constant-time" when accessing memory, a TLB miss for example will make the CPU traverse a tree; a page fault can execute arbitrary code).


Ah, in my context, N is the number of live allocated objects that the memory allocator knows about. If you use a data structure like a red-black tree to track the metadata, the work you do traversing and maintaining the tree will grow log N with the number of live allocated objects you're tracking. The radix tree specialization I presented is constant with respect to the number of live allocated objects.


This is O(n) because you're still doing the i < size comparison, even though you've moved it out of the for loop.


For almost all n (size), the function runs for MAX_SIZE steps, since almost all numbers are greater than MAX_SIZE. And it never runs for more than MAX_SIZE steps.


Given that pointer size is fixed, maybe asymptotic bounds aren’t what we should care about? Maybe it would be better to ask how the cost increases just going up to 2^64 bytes of address space. The graph after that point is irrelevant.

An O(n^2) algorithm will blow up well before that.


> Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

It's not meaningless. We're designing for a real machine here, and the actual goal is to bound the runtime to a small known constant.

You're right that big O is not great notation for this. Hell, even O(1) isn't good enough because the constant factor is unspecified.

But the misuse of notation does not invalidate the problem. Taking the worst case for a log leaves you with a feasible runtime. Taking the worst case for n or n^2 or "any bounded algorithm" overwhelmingly does not.

> Asymptotic bounds must be presented and understood in context.

Yes, context is exactly what keeps it meaningful!


I did this too! Except it was based on the original TLSF paper (on which this work is based) as I wasn't aware of that offset allocator.

> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1). The original TLSF algorithm (as proposed by the paper) was designed this way.

Months later, I also wrote a general purpose implementation of TLSF for Rust `GlobalAlloc`, which, with some rigorous optimizations, turned out to be faster and smaller than every allocator for WebAssembly and embedded environments I tested, and it's still my go-to choice. It didn't fare too bad even when compared against modern thread-caching allocators in normal use cases. Internal fragmentation is probably worse, though.


> In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1).

Yes, but the downside is that now allocator metadata pollutes the cache. It's super efficient for the allocator, but it may harm efficiency of the actual use of that memory. I believe most production allocators don't use object headers for this reason.


> I believe most production allocators don't use object headers for this reason.

Isn’t the original reason hardening against buffer overflows? Overwriting allocator metadata can be used to attack a system.


Yes. I believe that the standard workaround for this problem is to place the allocator metadata for an entire block at some fixed address boundary, so you can just mask off the bottom K bits of an address to find the allocator metadata for the block, and from there find the metadata of the allocation you're concerned with.


Have you published your code to GH, by any chance?


Offset allocator: https://github.com/yvt/xalloc-rs/blob/master/src/tlsf.rs Don't mind the mess, it's one of my earliest works in Rust.

General-purpose allocator: https://github.com/yvt/rlsf


> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

Besides the reverse mapping, you'd also have to get rid of the Vec use in the allocator, but that's fairly straightforward. Additionally, you'd need some sort of logic about requesting slabs of memory from the OS (and returning them when free), which is some extra work.


Author here. Didn't expect this to make HN :)

I don't deserve any credit for this; @SebAaltonen did all the work to write the C++ version. I just did a line-by-line translation.


What's the difference between this and a slab allocator? Is it just that the bin size distribution wastes less memory (assuming the slab allocator used pow2 bin sizes)?


Slab allocators don’t provide real time guarantees. That is arena allocators. The distinction may seem trivial but the requirements are distinct.

In tiny systems all allocations of one type may come from a single operation, but in larger systems what fits in a slab will come from distinct concerns with different priorities. You want a different arena for those allocations.


I'm confused. To me it seems that real time guarantees are a function of three things:

- Is it constant time in algorithmic complexity?

- Is it single threaded (i.e. free from locks)?

- Does it need to fetch memory from the system (not realtime safe)?

There are slab allocators that can satisfy all of these, and are therefore realtime safe.

So I'm still wondering what the difference is.


I have a best-fit allocator for this problem (managing GPU resources). It has higher CPU overhead due to using a std::map, but should waste less space due to not using fixed bucket sizes.

https://github.com/glaretechnologies/glare-core/blob/master/... https://github.com/glaretechnologies/glare-core/blob/master/...


> Please note that offset-allocator isn't a Rust allocator conforming to the GlobalAlloc trait

Aw. Why not?

> find the next available bin using 2x LZCNT instructions to make all operations O(1)

Isn't that sort of implementation defined? The LZCNT operation itself is a loop over all bits -- the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64. But if it was 16 or 32 bit, it may be faster, which means its not O(1), or rather, big o notation really breaks down here.


> It just makes it O(n) where n := 64

That's O(1).

> The LZCNT operation itself is a loop over all bits

That's not how circuits work. You can easily make a circuit of O(log n) depth that returns base-2 encoded index of the first 1 bit in a n-bit register.

But since n is just 64 here, you're looking at a circuit of AND/OR depth ~8, so it's no surprise that modern CPUs can compute the LZCNT / TZCNT of a 64-bit integer in a single CPU cycle.


Note that rustc will not emit this instruction by default, to support older CPUs. Without `target-cpu=haswell` or similar, the operation will be quite slow, but still O(1) of course.


It's not that bad, all x86 processors have the BSR instruction which can be mapped to LZCNT semantics with just a few extra instructions.

Left to right - modern x86-64, baseline x86-64, baseline i686:

https://rust.godbolt.org/z/nGsM35TEo

Maybe you're thinking of POPCNT, which hurts a lot more if it doesn't compile down to the native instruction:

https://rust.godbolt.org/z/xcxG3v4Mn


> That's O(1).

A loop over 2^64 elements is also O(1), but I don't think people are ok to label every program that can run on a 64 bit pc as O(1).


You should really refresh yourself on Big-O notation.


Whether you call that O(1) or O(n) depends on the level of abstraction you want to talk about, and what you want to analyse.

For example, in some sense, anything you can do on a finite computer is O(1) (or alternatively, it's an infinite loop). But that is a very boring sense, and seldom useful for analysis.

There's a formal definition for big-O notation, but most of the parameters to that formal definition are only implied in casual discussions.


> Aw. Why not?

Because the global allocator API is defined as free(pointer address) and this used free(buffer handle).

It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

> the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64

O(n) when n is constant is equal to O(1).

But lzcnt uses a fixed number of clock cycles so it's not really O(n) either.


GlobalAlloc is also required to be thread safe, and TLSF doesn't even attempt to handle that. I suppose you could get away with it on single-threaded platforms like microcontrollers or (vanilla) WebAssembly but it wouldn't generalize unless you wrapped it in a mutex, and then the performance would be terrible.


Correct me if I'm wrong but can't you put a Rc<RefCell<OffsetAllocator>> in your std::alloc::Allocator implementation for interior mutability?

This make a non-thread safe allocator, with the (desired) side effect of making anything allocating from it (e.g. Vec) a !Send.

Or is there a requirement that Allocators must be Sync and Send?


This is about GlobalAlloc, the trait that actually exists in stable Rust. Only static variables can be marked as the #[global_allocator], and statics are always required to be Send + Sync. Of course, if you're really sure that only a single thread will ever exist, you can pinky-promise it by putting a couple unsafe impls on the allocator type. But then the language won't protect you if you do try to allocate or deallocate from multiple threads.

(More practically, you can just write a GlobalAlloc that forwards to an instance of a thread-local allocator. Though that runs the risk of thread-local deinitialization running in an unexpected order.)


Any uc worth programming on has interrupts!


Even with interrupts there is almost always a single execution unit, and thus single-threaded. An interrupt is just a jump instruction to a predefined handler. The code that was running is entirely stopped until the system decides to resume from that point.


That doesn't mean that naive single-threaded code is necessarily interrupt-safe. If the allocator is interrupted and the interrupt service routine needs to make use of the allocator's data structures for any reason there could very much be a conflict.

This particular algorithm is focused on GPUs, so I'm not clear on the applicability. However I did want to clear up the idea that there are no concurrency concerns for single-threaded library code, an assumption that's been the source of many bugs.


The presence of interrupts is basically equivalent to not being single threaded, though. There are true single-threaded environments, it's just limited to userspace processes that don't handle signals.


It is even worse: re-entrancy safe is a different property than thread-safe. For example you can't use mutexes to guarantee mutual exclusions with interrupts.

Some algorithms are both thread safe and reentrancy safe, but generally this is not the case.


> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Well, another solution is enlarging each allocation by 8 bytes and storing the buffer handle in front of the returned allocation. So `free(ptr)` would get the handle as `*((ptr as *const u64) - 1)`.


That works but...

This kind of allocators are usually used for suballocating GPU buffers, so hiding a few bytes of metadata "in-band" can mess up your alignment requirements and not all kinds of GPU memory are even accessible by CPU. Due to false sharing cache problems you would probably want a full cache line (64 bytes) to store the metadata.

For a CPU-only memory allocator your idea could work quite well. It can also be implemented on top of this code without any modifications.


OTOH, I think this would be a good fit for the "Store" proposal [1] which uses handles rather than addresses to refer to allocations.

[1] https://github.com/matthieu-m/storage/blob/main/etc/rfc.md


> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Not necessarily. If you are able to map a block of normal memory in a known location relative to the GPU memory that you are managing with an "offset allocator", it should be possible to directly calculate the metadata location for each "offset". This is how most allocators find arenas/buckets (whatever you want to call them) for an allocation by a pointer.

Something like this:

  +-------------------------+ 0x000000 (start of managed memory)
  | Metadata                | 
  |                         |
  |                         |
  | ...                     |
  +-------------------------+ 0x001000 
  | Padding ...             |
  +-------------------------+ 0x010000
  | GPU Memory Block        |
  |                         |
  |                         |
  |                         | ~2MB block
  |                         |
  |                         |
  |                         |
  +-------------------------+ 0x210000
With this layout, to get to a metadata for an allocation, all you need to do is to align down the allocation pointer and calculate the appropriate location in the metadata page.

This obviously won't work in all scenarios, but it's a simple and practical way around a map lookup.


A good idea but not O(1).

If you have a large allocation you have to mark every "page" in metadata table which makes allocation O(size in bytes / page size).

The overhead might still be practically acceptable, even if it is not constant time.


Any big O discussion usually breaks down (or at least, changes) if you start taking into account the data size of the operands.

You generally assume for big O purposes, when analyzing sorts for example, that comparisons of elements are constant time, and that swapping elements is constant time.

On an n-bit computer, when dealing with m-bit elements, those assumptions are broadly sound. Comparing two ints doesn’t depend on the ints. Reading or writing the int from memory doesn’t take more or less time either.

But the same algorithm, while it has the same big O relationship to the size of a collection, might have a different constant factor on different CPUs and for different values of m. And you might find that some algorithms that constant factor’s relationship to m has its own big O.

One common place this can bite you is when you try to apply big O analysis that has ‘constant time comparison’ assumption built in to, say, sorting arbitrarily long strings, or inserting into a hash table using arbitrary string keys. Comparing strings or calculating hashes of them is not constant time.


In a practical sense, how you define big O depends on what you consider to be the inputs to the function. If the runtime doesn't change depending on the size of the inputs that you care about, it's O(1).

Like, you might have a function with 2 inputs, N and M, and the run time is like O(m2^n), but n is fixed at a low number every time you'll run it in practice, so it's really just O(m) for your purposes.


Right. O(f(n)) is literally only defined for situations where n 1: varies between different runs of the algorithm, and 2: can grow arbitrarily large. Even though in practice ‘arbitrarily large’ is always limited by memory, storage, etc.

Talking about algorithms being O(n) in the number of bits in a value is only reasonable if the number of bits in the value actually varies between runs.


In all x86 this is an instruction that typically takes 3 clock cycles (latency) with 1 clock cycle throughput. That’s as constant time as you get.

Even if that were false, it’s O(k) where k is the number of bits, which is constant. However, in that case it might be slower than the alternatives.


Integer addition, and indeed pretty much all operations, are also a loops over all bits, LZCNT is in no way at all unique here.


Why is this hard realtime in the face of arbitrary allocations and deallocations? These dots aren't connected anywhere within two links distance from the given URL.


How is this exactly going to be used in bevy?


At least one mention of it here https://github.com/bevyengine/bevy/issues/12590

> Use one set of large mesh buffers per vertex attribute layout

> Problem: Index/vertex buffers have to be re-bound when the mesh changes. This adds overhead when encoding draws and when drawing. It also prevents some optimisations like being able to draw all objects for shadow mapping for a light in one draw.

> Solution(s): [...] Use an appropriate allocator like a port of Sebastian Aaltonen's offset allocator [https://github.com/sebbbi/OffsetAllocator] to manage allocation

Where the "port of Sebastian Aaltonen's offset allocator" is what got linked in this HN submission.


The goal is to start storing multiple meshes in the same vertex buffer/index buffer. (I already have a decent chunk of this code written in a branch.) This reduces binding overhead, but moreover it allows us to start using multidraw indirect where supported, which should reduce drawcall counts dramatically. I measured a 97% drawcall count reduction for the opaque pass on the Bistro test scene, for example. The benefits are even higher for shadow maps, because they use a single shader and so the drawcall count should drop to the single digits in most cases.

I've said it before, but if drawcall count is a problem in your game, you should complain to the engine vendor. On modern GPUs, there's no reason people should be manually optimizing for drawcall count in 2024. Game engines have the tools to make it a non-issue; they just need to use them.


whats a offset allocator?


https://github.com/sebbbi/OffsetAllocator

which has links to a pertinent paper at the end of that page.


That doesn't explain anything.

And the linked paper doesn't even contain the word 'offset'.


I mean you could presumably read the code.

It looks like the general idea of TLSF is an array of second level arrays, where the second level arrays hold blocks of fixed size, and in TLSF those fixed sizes for the secondary arrays increase as powers of 2, and in this so called offset allocator the secondary level arrays use a more complex size statistically chosen. Allocation requests seemingly return an offset into a second level array.

Kinda uncommented code, to your point, makes it not obvious.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: