> "The library mustn’t call malloc() internally. It’s up to the caller to allocate memory for the library. What’s nice about this is that it’s completely up to the application exactly how memory is allocated. Maybe it’s using a custom allocator, or it’s not linked against the standard library."
OP's approach will indeed work for most "minimalist"/single-header libraries, but, I personally feel it pollutes the API you're exposing to your users.
Depending on the specific situation, I may sometimes choose to expose a MODULE_CreateObject() and a MODULE_CreateObjectEx(custom_allocator, custom_deallocator).
Internally, MODULE_CreateObject() calls MODULE_CreateObjectEx(), passing the module's default allocators and deallocators (ie. HeapAlloc and HeapFree). This strikes me as a more balanced approach.
One caveat here, is that you must enforce consistency across usage - you don't want some API calls to use malloc() for allocation, whilst others use HeapFree() for deallocation, that would be a recipe for disaster.
To ensure that, I would often set the allocators and deallocators once, when the object is first created. They may be set through the object's initialization function, and they persist as part of the object itself.
> I personally feel it pollutes the API you're exposing to your users.
Pollution is in the eye of the beholder. There are many circumstances where a project or subset of a project needs to work without a heap, they just don't necessarily overlap with the "application layer code in a virtual memory process" world your intuition is calibrated against.
And sometimes this stuff needs to read a JSON object or decode base64 or utf8 too, and can't because the library is too thick.
> "Pollution is in the eye of the beholder. There are many circumstances where a project or subset of a project needs to work without a heap, they just don't necessarily overlap with the "application layer code in a virtual memory process" world your intuition is calibrated against."
That's an argument in favor of offloading allocation/deallocation to the library's users, which is exactly the core of my, and OP's, proposals. We're saying the same thing here - developers should be able to determine/control how memory is allocated and deallocated.
> "And sometimes this stuff needs to read a JSON object or decode base64 or utf8 too, and can't because the library is too thick."
I'm losing you here. I honestly feel that my proposal is all about keeping the API as simple as humanely possible, without compromising the library's flexibility when it comes to the scenarios your mentioned earlier.
But what if I don't have a heap? Not even a wrappable heap.
I could be an OS bootstrapping layer, a signal handler, an ISR, a process control project operating under strict 'No dynamic allocation!" rules, a thunking layer to get legacy code modes (BIOS says hi!), ...[1]
You're imagining a world where everything is Node or Python or Java, or at the worst C on top of the well-defined standard library. And I'm telling you that the world is bigger than that.
And more specifically, that those weird layers sometimes need library code too.
[1] (Edited to add) A malware payload, a tracing layer, a compiler-generated stub, a benchmarking hook that can't handle heap latency, ...
> "You're imagining a world where everything is Node or Python or Java, or at the worst C on top of the well-defined standard library. And I'm telling you that the world is bigger than that."
Why do you keep putting words in my mouth?
> "But what if I don't have a heap? Not even a wrappable heap."
I'm forced to repeat myself over again. At no point does my proposed API force you to rely on a heap. On the contrary, it lets you rely on whatever solution works best for you, in your specific case.
In your custom kernel project, your custom allocator() can return a buffer from a memory pool you handle yourself. Your custom deallocator() will reclaim that memory back into your custom memory pool.
In a different project, say a desktop app for Windows 10, the allocator() will simply call malloc(), and the deallocator() will call free().
This way, your allocator() can do whatever. Your deallocator() can do whatever. How is this restrictive in any way shape or form?
> In your custom kernel project, your custom allocator() can return a buffer from a memory pool you handle yourself. Your custom deallocator() will reclaim that memory back into your custom memory pool.
I don't have either. I have a statically allocated buffer big enough for one frame of data, and I need to guarantee that it never gets used twice. My code does not have a custom allocator. It does not allocate.
Are you seriously suggesting this is an issue? That's entirely up to you to solve in your custom allocator.
Use whatever mechanism is available to you. Use a global condition variable, check it atomically every time you're entering your custom allocator, increment after a successful allocation. I don't know your system's constraints, nor should I...
I'm not talking about concurrency. I'm talking about needing to know exactly how many bytes are being allocated ahead of time, because I've got 192k of ram, and 112k of them are spoken for by I/O buffers.
If I pass in an allocator that returns the statically allocated buffer, then the second call to it must abort loudly.
> In your custom kernel project, your custom allocator() can return a buffer from a memory pool you handle yourself. Your custom deallocator() will reclaim that memory back into your custom memory pool.
You realize you're arguing that a custom probably buggy heap implementation isn't a heap right?
> In your custom kernel project, your custom allocator() can return a buffer from a memory pool you handle yourself.
We're done. "It's OK, you can just write your own heap-like API!" is just not remotely responsive to the kind of problems I'm talking about, and that you think it is is sorely tempting me to put more words in your mouth.
If you don't think these libraries are useful, that's fine. Don't use them. Don't presume to understand the application realm before you've worked in it.
I also work in the resource-constrained / embedded native space and have had to work within the kinds of constraints you're describing. I think you're severely misunderstanding what the comment you're responding to is proposing.
Then you'll have to point me to a real world example of an API that works like that, because this is balderdash (seriously? Implement malloc and free on top of a stack-based memory pool just to decode a buffer?) to my eyes.
But what's the alternative to passing custom allocators and deallocators if you want to tightly control the way a library manages memory? If you're running with such constraints, presumably you want to be in control of memory management and not just leaving the library to do its own thing.
At this level, surely it's better to leave it to the "user" of the library - who can always write a wrapper for all their used libraries to the same style API for use in the rest of the program
This is true for libraries that have a lot of variable sized types. There are however a lot of interesting problems where you don't need many of these apart from, say, one big buffer you work with. A good example of this is a parser, or this TLS library: http://bearssl.org/. It makes integrating with a library maybe a bit more tedious but it comes with so much more control. And you could always build a layer on top that does malloc for when you don't need the control. It's great for code that will be used in many different scenarios.
Choose the size representation that bests fits your use case:
// Put this in header to help user calculate allocation needs but hide size from user
size_t LIBNAME_alloc_size(param1, param2, ...);
// Put this in the header to hide the size from user code but allow inlined size calculations
extern const size_t LIBNAME_ALLOC_X;
// Put this in the header to make size known to user (for static const allocation)
#define LIBNAME_ALLOC_Y ((size_t)42)
The no malloc within the library cannot be overstated, let the user of the library decide what is best and do not do some junk like `some_fn()` and require an implicit `some_free()`
Although it may work for the integer hash set, forcing the application to deal with potential resizing when adding to a generic hash table is painful. I'll note too that the growable buffer example linked at the end violates the no malloc rule.
I disagree. This is an artificial limitation that severely hampers the library's capabilities. That kind of discipline may be fine for normal linear string-crunching, but it's cumbersome to an asinine degree if your library needs to do any sort of complex ADT manipulation.
It calls to mind Fortran-style function docs which would give a formula to compute the size of a work array for the caller to provide. That really ties the implementation to the header. It was really boring to work with, though it often pushed you to understand what was going on inside.
This kind of interface was common in fortran 77. I’m guessing you’ve used BLAS/LAPACK? In more modern fortran, it’s not so usual. Also, Fortran is much more high level than C, and how memory is allocated etc is compiler dependent to a much larger degree than for C.
I’d recommend writing a Fortran 95 wrapper which creates work arrays for you if you really need to use such routines.
Most often you do not need any sort of "complex ADT"; in that case the no-malloc advice is good but harmless.
Sometimes, you can probably do without any "complex ADT"; in that case, the no-malloc advice forces you to find the clean solution without complex ADT, thus it is really great.
In the rare cases when you intrinsically need a complex ADT, then you do it. The advice is a spirit, not an unbreakable constraint. Just like not using goto.
But what if "some" is not subject of my program and I want to offload it's complexity away so that I can concentrate on actual subject of my program? In that case wouldn't it be better to have some_alloc, some_foo, some_free rather than to need each user to make their own, probably buggy and slow allocator?
If I'm making image editor, allocating bitmaps myself is fine, if I'm writing crud app, not so much.
At least in game development, custom allocation code is often used to either provide more debugging and profiling capabilities, or otherwise add functionality to the generic C or C++ allocator. Every good middleware library provides a way to hook into memory allocation, and ideally annotates each allocation with some sort of tag or label, so it's possible to track from the outside what an allocation was used for.
We use wrappers where I work, as we track all allocations. We can provide application context if something goes wrong, we can track usage (per thread, per subsystem, over time), and probably most importantly, we know the allocation pattern of our applications better.
A library is not a framework. A library provides the basic bricks and typically the users build a layer that fit their needs. It could be, for instance, C++ classes.
Furthermore if the authors of the library think they know better with regard to allocation, they can provide an allocator as a separate addition to the library.
With respect to the bmp example:
1. Removing the bmp_get implies that the application needs to implement a shadow image if it needs to check the color o a certain pixel. Or even worse, take a direct peek in its void memory area.
2. void pointers instead of bmp_pointers makes it easier to create a mess, the compiler will not tell you that you called the bmp library with a pointer to a jpg memory area.
3. Not doing range checking in the library - but imposing that burden on the caller - is a bad practice. If the caller does the same - expects his caller to do the checking - we end up with a sequrity risk.
Trying to minimize the library by pushing work to the application is wrong every time you expect the library to be used more than once. Despite these objections, I like libraries that are free of IO and mallocs!
When I fell down into the rabbit hole of DNS, I wrote code to just encode and decode DNS packets [1]. All the existing libraries [2] had a complex API that provided a separate function for querying a few record types (A, AAAA, MX, TXT, SRV, maybe NS and SOA), leaving the rest unimplemented. They also tend to have complex network architectures to handle retries, caching, and parallel queries which could be hard to integrate into a project that had an existing network framework.
Mine? Just two functions: dns_encode() and dns_decode(). No I/O. No malloc().
...by having your own arena allocator! I do agree that it is quite doable in this particular case, but I always remember that a custom memory allocator of OpenSSL made Heartbleed much more devastating.
The corollary of "no error handling" is "caller must be perfect", it is the opposite to defensive coding. I can see the appeal, it puts minimal constraints on the caller, in a sense making the library functions as flexible as possible, but exposing an API so easy to misuse seems reckless, even if the API is elegant in its own way.
I'm happy to say my crypto library¹ satisfies most of his criteria:
It has 50 functions. That's too much, but it could be reduced to 10 if the user stick to the highest level facilities. There is no dynamic memory allocation, and no I/O (actually, it doesn't even depend on libc). The structures are defined in the header to allow the user to allocate them on the stack, but looking inside is unneeded and discouraged.
It's worth pointing out that his two favorite RNGs (xoroshiro128+/xorshift128+) both fail BigCrush. According to [0] and the associated github [1], for a statistically strong RNG which is still fast, AES-CTR or splitmix64/lehmer64 are probably your best bet, unless you have AVX512, in which case an SIMD-accelerated PCG is the way to go [2]. (The other methods cap out at 1 cycle per byte, while the AVX512 PCG is 1 cycle per 32-bit integer, 4x as fast as the fastest previously tested.) While I don't doubt it could be further accelerated, I've added STL compatibility, templated unrolling, and provided some extra utilities (including random access) in a package based off code from [1] (provided by Samuel Neves) which I now use in most of my projects, and which is available at [3].
Not sure I think using AVX512 makes sense in an RNG unless you're already using those instructions for a lot of stuff. Using instructions from outside the current power bin will adjust the clock rate and frequency of the core, and doing so is very slow (measured in milliseconds -- voltage regulators need to adjust).
In general unless you have a lot of AVX512 code to run (several ms worth), you're usually better off avoiding those instructions IME :(. (The same is also true for many AVX2 instructions...)
The way the mentioned PRNGs "fail" is when testing just the lower bits (search for the occurrences of "lsb" in [1] above) and this may be important in your use cases or not. The same [1] claims in "Visual Summary" that the "cycles/byte" is 1 for various PRNGs but http://xoshiro.di.unimi.it/ seems to show that the reason splitmix64 is not preferred everywhere is that xoroshiro128+ is roughly two times faster than splitmix64 .
Regarding having lower bits poor statistically, it was known since forever that that is the case for the huge class of simple PRNGs (effectively all that are faster than the alternatives, unless maybe if there's some specialized instruction in the CPU), the question is if that is critical or not for your purposes. The author of xoroshiro128+ is of course aware of that issue, and he also writes:
"For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits)."
In short, if you don't know how you're going to use the PRNG and you don't have problems related to speed, sure, use the safest one. Note that "safety" is still differently understood in different use cases, e.g. take care to note that most of fast PRNGs still aren't cryptographically secure:
and that sometimes even "standardized" "cryptographically secure" turn out to be something else, e.g. the subtitle on that wikipedia page:
"NSA kleptographic backdoor in the Dual_EC_DRBG PRNG"
Also other considerations come into play when you have some specific needs and you understand the consequences: then it's not only black-and-white "safe" v.s. "not safe." For some purposes (as the mentioned generation of the floating-point numbers in some use cases) speed matters enough to sacrifice some "perfectness."
> The same [1] claims in "Visual Summary" that the "cycles/byte" is 1 for various PRNGs but http://xoshiro.di.unimi.it/ seems to show that the reason splitmix64 is not preferred everywhere is that xoroshiro128+ is roughly two times faster than splitmix64 .
I have tested xoroshiro128+ vs splitmix64 in several procedural generation & simulation code bases in C and Swift. I could never confirm the numbers on http://xoshiro.di.unimi.it/. In fact, splitmix64 was slightly faster in all my tests with different optimizations enabled. I always assumed that's because its state only occupies a single register which certainly matters in practical applications (especially in C with its restricted calling conventions). I am not absolutely sure whether that was always the reason, though.
I use fast RNGs for kernel projections, FHT-accelerated JL transforms, and data generation for numerical experiments. I don’t need cryptographic security for these purposes.
And if you generate floating point numbers maybe you don’t have to worry about lsbs alone either? The “failed” tests drop away the bits that matter the most when floating point randomness is constructed.
I don’t know the details on which bits matter in the ziggurat algorithm, which is the one I use. Is this the case for all floating point random number generators?
The PRNGs you mentioned before all generate integers. The conversion from the integer PRNG to the floating point, and the conversion from one integer range to another range needed in the ziggurat algorithm both need to be done "right" to give correct results (I can imagine that even using splitmix64 the wrong implementation could be programmed by somebody not knowing what has to be done), so if you aren't sure about these steps you should surely check their quality yourself. If these are done right, I'd personally expect xoroshiro128+ could be "good enough" even when having that specific weakness (poorer quality when using only lsb bits) that you worried about. It's important, of course, not to drop the highest bits away.
And on another side, 2om3r questions the speed measurements of the author, and I think he has a point: the speed measurements should be made in the context of the real use, otherwise the compilers are able to "cheat" (optimize pieces of the code away) if the example is too simple.
If I recall correctly, Chacha20 is as fast as AES with AVX256, isn't it? For more speed, Chacha8 has yet to be broken. And I bet even Chacha2 would pass most statistical tests, although at that point it is not secure at all. Furthemore, we could ditch Chacha compatibility by skipping the de-interleaving step for more speed.
AES could likewise benefit from reduced rounds, but since its security margin is lower than Chacha, there's a chance it would perform a bit worse at the same quality level.
I can't find a benchmark for chacha20 as a PRNG (I've only found benchmarks for salsa20...). Why don't you try the code from [1] and see how it compares to your chacha20 random number generator?
PCG is covered in sources [0-2] and my comment. It's great, but without AVX512, it's ~70% as fast as AES-CTR, lehmer64, splitmix64, and the xorshi*[0-9]\+plus families of RNGs, which in [1] were all roughly the same speed. In [2], Lemire's PCG implementation is even faster than his AVX512 version of xorshift128+.
The simple PCG example on the page outputs 32-bit numbers, using 64-bits of state and 64-bit arithmetic.
What is the recommended way to generate 64-bit numbers with PCG? Just generate two 32-bit numbers and stick them together? Or does that introduce bias or bad parformance?
Yes, it will introduce bias. Use the variants with larger internal state.
If you want to see why, you could play with small RNGs with 4 bits of state each, with 2-bit outputs. Then concatenate the outputs from each and check for uniformity.
For example, say the first RNG is given by the sequence (with the value of low 2 bits following the slash):
Each have 16 unique states, and each of the four possible 2-bit outputs appear exactly 4 times in the output of a generator. So each generator is uniform.
Now create a 4-bit rng by concatenating 2-bit outputs from each generator: 3|0, 1|3, 0|3, 3|1, 3|2, 3|1, 0|0, 2|2, 1|2, 2|1, 0|0, 1|0, 2|1, 1|2, 2|3, 0|3.
This is all the 16 outputs we can get from the two generators with a period of 16 each, but you can already tell that some outputs appear more than once (0|0, 0|3, 1|2, 2|1, 3|1) and thus, obviously, there are others such as 0|1 or 0|2 that never appear!
For 4 bits of output, you really need a larger period. But even that does not guarantee uniformity when you're concatenating outputs from two independent RNGs. In fact the likelihood of getting uniformity by concatenating two random RNGs is practically nil.
On the other hand, for a single linear congruential generator, it is easy to guarantee uniformity by choosing the parameters according to the well known rules.
Sticking two 32 bit numbers (e.g. `(uint64_t(a) << 32) | uint64_t(b))` will not introduce bias.
IIRC, the PCG C++ distribution has a 64 bit variant (it uses 128 bit integers, which are implemented in software). I don't know if the performance is better or worse than calling the 32 bit variant twice.
Its a style of programming C was built for, and not many other languages.
It's almost the opposite of good style elsewhere: aggressively procedural code with an elegant boundary.
I'd say it would be a good lesson to many who are down the or pure functional rabbit holes: perhaps there is some other sense of modularity that goes missing when "single purpose" is applied too narrowly.
> Its a style of programming C was built for, and not many other languages.
Rust seems well adapted for it:
> (1) Small number of functions, perhaps even as little as one.
One annoyance with that kind of stuff is that you'll get many small dependencies rather than a few big ones. Having a simple and easy way to acquire & maintain these dependencies is useful, and Cargo provides that.
> (2) No dynamic memory allocations.
> (3) No input or output.
That's pretty much what a #[no_std] (libcore-only) library is[0].
> (4) Define at most one structure, and perhaps even none.
That's the bit I disagree most about, but if that's what you want the language won't stop you.
[0] unless it's requires nightly and uses alloc directly but that's not too common right now I think
I just want to understand that given the mostly negative reaction that's here at HN to the npm ecosystem which is based on the same "minimalist libraries" idea, how is this different?
I'd appreciate if the response is not about JS and/or C, but about the minimalist libraries in JS, C, or any other language. Should I use a large number of small libraries? Should I wrap up some of the code which I use into a libraries even if that code is just a couple of functions without any data structure?
Moreover, I'd admit that I'm a great fan of Chris Wellons blog posts which are pretty technical and original, and use some of his emacs libraries on a daily basis.
>I just want to understand that given the mostly negative reaction that's here at HN to the npm ecosystem which is based on the same "minimalist libraries" idea, how is this different?
It's different in that in C you don't pull in 200 dependencies which in-turn bring in another 10+ dependencies each.
You just use 2-3 libs you need (and that they, in turn, don't require anything, or at best the POSIX standard libs), and that's it.
I think the difference is the network of dependencies. Even small libraries on npm often depend on other libraries (e.g., the is-odd library linked in the article depends on is-number).
Minimalist libraries as in the op are generally done out of a fear of libraries, as if libraries take away control from the programmer and make the resulting program lesser. Minimalist libraries in npm arise out of a reverence towards libraries, as if something being implemented in a library is superior to an equivalent homebrewed implementation.
When the OP says a library should ideally have one function, they mean that the one function should do the brunt of the work, but leave any setup and takedown to the caller. When a node lib has a single function, this is because you are expected to import another lib for other purposes (eg left-pad for left padding, right-pad for right padding, rather than a single, generic pad library).
The difference is that there is no package tool. These libraries are not dependencies that can disappear or whose API will break. You are expected to copy them into your source tree, or even use only the idea and make your own version. Also, as my sibling poster points out, they are self-contained.
I think in recent years dogmatic minimalism has shown itself to be destructive. Time is wasted refactoring code and features are removed, reducing usability, all in service of an aesthetic that has lost sight of its purpose.
Each of the guidelines listed independently represent good practices, in general, when applicable. It's worth considering the design choices when developing a library. But promoting a definition of minimalism implicitly promotes an all-or-nothing approach to development. The choice to not include memory allocation should be entirely independent of the choice not to include I/O. Pretending otherwise indicates that the choices aren't driven by pragmatism.
Maybe I'm just noticing it more, but I think there's more discussion of how to write C well than there used to be. I speculate that the attention Rust has brought to safe systems programming has caused an uptick in interest in closer-to-the-metal languages in general, and spurred C programmers to show that there are reasonable ways to write C as well. It may be an unanticipated result of Rust's popularity that the quality of C programming improves. (Or perhaps that was the plan all along?)
What I've noticed is the lead time for all sorts of things used in small embedded systems has gotten terrifyingly long. What says to me that there is a lot of embedded work going on.
Also in the last 5 years people have abandoned the JVM as a end all be all platform which puts you squarely back into native code again.
Most things here are reasonable, but I don't see a point about having only one struct. If your state is better organized in lots of hierarchical structs, within lists, within other structs, your data will be easier to move around, copy, and zero in smaller chunks, and you can write functions which processes isolated segments of data rather than a huge global state.
The interface of the library is intended to help the user of the library, not the developer of the library.
The simplest interface is the best for the user, who does not want to know anything about the implementation details.
In the ideal case, the developer and the api designer will be different persons who are not in good terms to each other. The more the developer hates the api designer, the better.
As the user of the library I probably don't care about the hierarchies. If I want X the library can provide me a function to get X by looking up its internal substructures and arrays so that I don't have to.
This basically echoes my library building philosophy. The two biggest things are:
1. User-facing complexity: Keep the user interface just big enough to get the job done. Put the "90% of people" interface first and foremost and if you need to cover the other 10%, expose a different interface that's CLEARLY marked "Advanced. You probably don't need to use this". Don't get sucked into chrome plating everything.
2. Internal complexity: Keep your structures simple. Make your functions do one thing and exactly that thing, well. Keep your side effects to a minimum. And keep your dependencies low, because you can't trust that other people have done the same in their libraries.
I don't think there are languages similar to C in term of simplicity, close to the metal and speed.
C++ is good enough for me, but it's so slow to compile, and I don't use its most advanced features.
I wish there was a language between C and C++, without the complex semantics you can find in rust and other exotic syntax.
I don't necessarily love C or C++ in term of feature, but the syntax is just what i need. Why can't language designers write a language that is closer to C, with fancy features that don't change the language so much?
Maybe you would like the BetterC mode in the D programming language... if you mean you like C for it's syntax too. A lot of modern systems prog. languages seem to adopt a more modern syntax, i.e. types after the names rather than before, no semi-colons, etc.
D stays true to C in this regard and offers a lot of fancy features. And the BetterC mode sounds suited to your requirements in that the language features doesn't over complicate things.
Personally, I'd add "don't write to your own memory" (with various stack use rules based on expected library use) and relax the "no structures" rule to encourage code that can be used from multiple threads simultaneously. Make the first argument always be the internal use data.
I’d argue these APIs could be further minimized by using a struct with a void * and a size_t in it, along with bounds checking accessors.
This would eliminate most of the ugliness in the post-allocation calls, allow for the deletion of most error checking code in each library, and would harden BMP parsing “for free”.
Please do not do this! There is so much bad advice in this article. Remember the rule: Your code should be simple, but not simpler. There is absolutely no need to abandon great facilities afforded by language and libraries to make things unreadable, undebugable and unmaintainable.
Abandoning some facilities does not "make things unreadable, undebugable and unmaintainable" and can allow them being used much more widely.
The limitations seem similar to #[no_std] in Rust, and while that's not something to strive for at all costs if you can do without it allows e.g. embedded developers or kernel/OS developers to use the work.
> e.g. embedded developers or kernel/OS developers to use the work.
I certainly agree that's one of the strongest reasons to avoid allocating memory etc. It's pretty clear that not many commenters have done any work outside a hosted environment... but I guess that makes the point that we're in fairly specialised territory.
Sometimes those aren't so great. For example, C has errno, a thread-local variable that gets set to the error code of the last function you called. Why can't the function just return the error code? I think it's strange how all the Linux system calls do return error codes but the standard library puts them in errno anyway.
I really like writing freestanding C because I can avoid most of the legacy.
Don't you think using errno is a cleaner pattern? Conflating the place where you expect a value and the place where you check for a reason why no value could be produced can be dangerous. (E.g.: if the value is an integer, an error code is mistaken for a value.)
Linux returns negated errno constants. This produces error codes which are outside the range of valid values. I think it's more sane than errno.
Functions of my own design almost always return status codes only. Actual data is returned through pointer parameters. This allows me to quickly determine the exact set of variables that are affected by any function call.
OP's approach will indeed work for most "minimalist"/single-header libraries, but, I personally feel it pollutes the API you're exposing to your users.
Depending on the specific situation, I may sometimes choose to expose a MODULE_CreateObject() and a MODULE_CreateObjectEx(custom_allocator, custom_deallocator).
Internally, MODULE_CreateObject() calls MODULE_CreateObjectEx(), passing the module's default allocators and deallocators (ie. HeapAlloc and HeapFree). This strikes me as a more balanced approach.
One caveat here, is that you must enforce consistency across usage - you don't want some API calls to use malloc() for allocation, whilst others use HeapFree() for deallocation, that would be a recipe for disaster.
To ensure that, I would often set the allocators and deallocators once, when the object is first created. They may be set through the object's initialization function, and they persist as part of the object itself.