What's not cheap is requesting RAM from the OS piecemeal.
In this particular case the curl maintainer ended up saving a bit of RAM as a side-effect, but it's actually much more common for programs that benefit from this class of optimization to waste RAM as a side-effect, and yet be much faster.
For example a common optimization in high-level programming languages is to overallocate memory for strings. E.g. if the user requests "x" you allocate 16 bytes, if you need 17 bytes you allocate 32 etc.
This wastes RAM, a 513 byte string will malloc 1024 bytes, but ends up being much faster on overage exactly because RAM is cheap, but allocations are expensive.
On Linux. Basic core libs like libCurl can find their way into MMU-less embedded systems with relatively crude malloc()s. Sane allocation behavior is needed to ward off fragmentation that hurts the entire system.
Wouldn't it be better to write a shim library that provides Linux-like MMU behavior at a userspace per-process level, and then target libs like libcurl to it? It'd be a sort of "memory policy driver" that you'd port to an embedded OS.
You really don't want library code providing or dictating malloc behavior, that's how things like OpenSSL and Heartbleed end up happening. It may be slower, but dealing with the politics of a particular malloc can be uglier than just making your code work well enough most of the time.
Maybe more than you thing, if you're running a multithreaded app malloc/free/new/delete need to have mutexes to protect the heap data structures, that can also lead you into the kernel
so what, to return from a freelist you alter the head pointer to point to the next item and return it
that's what, 2 or 3 instructions? even if you push that upto a few hundred it's still effectively free compared to anything else curl is going to do (e.g. writing to the disk)
you have to handle the case in which you have no remaining items as well, breaking up a chunk log(chunkToBreak/chunkSizeNeeded) times. And in the case that you have no viable chunk to use, you need to call back to the OS.
well yes, they removed the network overhead and disk overhead by running it on localhost and writing to /dev/null, and once you remove two bottlenecks you'll find that something else becomes the new bottleneck
libc is not the OS, it's the standard library for C, in the same way the java class library isn't the OS, it's the standard library for java
You seem insistent on having some side-thread argument with me for some reason.
> once you remove two bottlenecks...
The article we're both commenting on goes to great length to point this out. I don't know why you're trying to convince me or anyone else of this.
Yes, in a network library malloc overhead isn't usually a worthwhile thing to worry about, but libcurl is hardly just any random library.
> libc is not the OS, it's the standard library for C.
I'm not trying to convince you that you should consider libc part of the OS.
I was merely clarifying that in the context of my comment, which you replied to assuming I just meant the kernel, that I was talking about the C library as well.
FWIW lots of people use "Operating System" to mean the set of software you get by default. OpenBSD, GNU, OSX, Android etc. definitely consider libc to be part of the OS, and Android probably considers its java class library part of the base OS. If everyone meant "the kernel" by "the OS" we'd just say "the kernel".
You don't have to agree with that, but correcting people on that point is about as futile as trying to convince people that your use of "hacker" is correct.
I would only add that the expensive part isn't the allocation per se, but copying the data from one chunk to another. If you allocation 1024 when you only need 700, when you end-up needing 701 you don't have to copy all 700 bytes into a new chunk. This can add up pretty quick if you're copying the data every time you append one new element. If I appended 20 bytes to my array of 700 one at a time and copied each time, I'd end-up copying around 14K worth of bytes when everything is said and done.
But that said I would also keep in mind that `malloc` will generally be pretty fast - it's not going to hit up the OS every time you make a request. The advantage here isn't because `malloc` is slow, it is because copying is slow.
Assuming 2 GHz CPU and reasonable assumption data is in L1 cache, reading 14 kB from L1 cache takes: 14kB / (32B*2) / 2 GHz = ~110 nanoseconds.
It'll take a bit longer (maybe 3-5x as long?) to write it to newly allocated memory (likely cache misses), the point is copying memory is just not as expensive as is often assumed.
I'm pretty sure malloc is the bottleneck in this case, pointer chasing is slow.
I don't quite understand your argument - you give lots of numbers for how long it takes to copy, but ignore cache misses and just come to the conclusion that `malloc` is the bottleneck because "pointer chasing is slow". If you're going to give theoretical numbers for copy, why not give numbers for the "pointer chasing" as well? The numbers you gave don't really mean much if there is nothing to compare them too. Yes, 110 nanoseconds is fast, but copying a 700 byte array 20 times was never intended to be an example that would be grinding your computer to a halt.
You also don't give enough weight to cache misses in this scenario. A L1 cache miss (So the memory is in L2) is going to be approximately a ~15x slow down from L1. A full main memory hit will be about ~200x slower. Not nearly the 3-5x you're guessing. More-over, the cache missing in `malloc` will likely be the same memory you're allocating anyway, meaning that the cache misses were likely going to happen regardless.
With that said, the `malloc` time should be somewhat constant regardless of the size of the allocation (Unless your allocation requires requesting memory from the OS), but the copying won't be. Once you start getting into structures that are fairly large in size, `malloc`ing new memory for them anytime you expand them will end-up being a fairly expensive operation that I would easily wager dwarfs any time spent in `malloc`.
But I mean, it's easy to argue basically any of these points without actual testing, so I've tried running a few basic tests on my own. Obviously your results would vary (I'm currently running on a fairly slow CPU):
First, I ran tests on different block sizes and number of allocations without freeing any of the allocated memory. In these tests, the copying tended to overshadow the allocations without copying when your block size was above around 8k. But because I wasn't calling `free`, allocating would quickly eat up most of the time if the number of allocations was increased (With most of the time being spent in the kernel). I did most of my tests with around the 40000 allocations, and above that things would get quite slow with the allocations alone even with block sizes larger then 8k.
With calling free, however, things tend to lean much more heavily in favor of the copying being more expensive. Right around a block size of 1k is where the allocation costs about the same time as the copy. The speed-up is of course due to `malloc` always have enough free memory available and not having to ask the OS for more (So this test represents the basic cost of the "pointer chasing" alone). Also, like I had assumed, the time for `malloc` without copying is 100% dependent on the number of allocations - changing the size of the block doesn't make any change to the time without copy.
Now that said, these tests are by no means definitive, but at the very least they do show that `malloc` is not necessarily expensive compared to the copying, especially once you go above a few k in your `malloc` size, and/or when the total number of allocations starts to ramp up very high. If your allocation size is fairly small though, then I would agree the `malloc` time is at least equivalent (or more) then the time spent copying.
I oversimplified, but I think my numbers are about in the ballpark when it comes to copying. It's fast, because CPUs are optimized for linear memory access patterns. I also took into account high likelihood the memory was touched before copying and will be touched again immediately after copying. One should consider cache misses on system level, not in isolation.
"malloc" will need for the very least to keep lists of free areas and try to avoid virtual memory fragmentation. Fragmented heaps (="real life code" situation) are significantly slower than fresh clean heaps, because the are more allocation list entries to check.
Because amount of work malloc needs to do is variable, it's hard to give any numbers how long that pointer chasing operation will take.
Pointer chasing is slow, because non-sequential access patterns, such as following pointers, create a data dependency and cannot be prefetched. The CPU is likely to stall waiting for the dependency to be resolved. Typically each pointer dereferencing ("chasing") operation will also touch a different cache line, so cache misses are also common.
Synthetic tests are way more likely to reside in L1 cache (including allocation lists) than anything in real life.
Many malloc implementations mmap or equivalent allocations that exceed a certain threshold. In that case, copying is slow; you'll pay a lot extra for page faults during the copy. Page faults might also occur on a fresh heap in a synthetic test.
I know from personal experience lots of small size mallocs sometimes cause non-trivial performance issues.
Standard disclaimer concerning any optimizations: Always profile your code. Things work differently in real code than in benchmarks, and what was true last time in your real life scenario is no way guaranteed to be true next time.
What's not cheap is requesting RAM from the OS piecemeal.
In this particular case the curl maintainer ended up saving a bit of RAM as a side-effect, but it's actually much more common for programs that benefit from this class of optimization to waste RAM as a side-effect, and yet be much faster.
For example a common optimization in high-level programming languages is to overallocate memory for strings. E.g. if the user requests "x" you allocate 16 bytes, if you need 17 bytes you allocate 32 etc.
This wastes RAM, a 513 byte string will malloc 1024 bytes, but ends up being much faster on overage exactly because RAM is cheap, but allocations are expensive.