RAM is cheap. What's not cheap is requesting RAM from the OS piecemeal. In this ...

blibble · on April 23, 2017

it doesn't go to the OS each time you call malloc()

the first time malloc() is invoked it will ask the OS for an oversized chunk of memory(probably using mmap() or sbrk())

allocations consist of incrementing a pointer, or pulling a a chunk off a freelist, both of which are effectively free

mmap() is very fast to call too, so I'm not sure what you're on about

regardless, curl is a piece of software that spends 99.99% of its time waiting for network traffic

kevin_thibedeau · on April 23, 2017

On Linux. Basic core libs like libCurl can find their way into MMU-less embedded systems with relatively crude malloc()s. Sane allocation behavior is needed to ward off fragmentation that hurts the entire system.

derefr · on April 24, 2017

Wouldn't it be better to write a shim library that provides Linux-like MMU behavior at a userspace per-process level, and then target libs like libcurl to it? It'd be a sort of "memory policy driver" that you'd port to an embedded OS.

Sanddancer · on April 24, 2017

You really don't want library code providing or dictating malloc behavior, that's how things like OpenSSL and Heartbleed end up happening. It may be slower, but dealing with the politics of a particular malloc can be uglier than just making your code work well enough most of the time.

Taniwha · on April 23, 2017

Maybe more than you thing, if you're running a multithreaded app malloc/free/new/delete need to have mutexes to protect the heap data structures, that can also lead you into the kernel

adrianN · on April 23, 2017

Pulling a chunk off a freelist and maintaining that free list is most definitely not free.

blibble · on April 23, 2017

so what, to return from a freelist you alter the head pointer to point to the next item and return it

that's what, 2 or 3 instructions? even if you push that upto a few hundred it's still effectively free compared to anything else curl is going to do (e.g. writing to the disk)

dxhdr · on April 23, 2017

A structure allocated in one chunk has cache locality and will be significantly faster to access than dereferencing multiple pointers.

The allocation itself is not the bottleneck, it's the poor choice of data structure.

vvanders · on April 23, 2017

Yup, malloc/free isn't usually algorithmically expensive but the cache misses tend to be the brutal part.

You'll spend 15 cycles doing work on 600 cycles of cache fetching. Plus you've just evicted some cache for your local scope.

nostrebored · on April 23, 2017

you have to handle the case in which you have no remaining items as well, breaking up a chunk log(chunkToBreak/chunkSizeNeeded) times. And in the case that you have no viable chunk to use, you need to call back to the OS.

avar · on April 23, 2017

    > I'm not sure what you're on about.

You clearly interpreted OS to mean "the kernel" in my comment. To be clear I'm including the C library which implements malloc() in "the OS".

    > curl is a piece of software that spends 99.99%
    > of its time waiting for network traffic.

You're commenting on an article that shows that in some cases 30% of the time was spent on malloc() overhead.

blibble · on April 23, 2017

well yes, they removed the network overhead and disk overhead by running it on localhost and writing to /dev/null, and once you remove two bottlenecks you'll find that something else becomes the new bottleneck

libc is not the OS, it's the standard library for C, in the same way the java class library isn't the OS, it's the standard library for java

avar · on April 23, 2017

You seem insistent on having some side-thread argument with me for some reason.

    > once you remove two bottlenecks...

The article we're both commenting on goes to great length to point this out. I don't know why you're trying to convince me or anyone else of this.

Yes, in a network library malloc overhead isn't usually a worthwhile thing to worry about, but libcurl is hardly just any random library.

    > libc is not the OS, it's the standard library for C.

I'm not trying to convince you that you should consider libc part of the OS.

I was merely clarifying that in the context of my comment, which you replied to assuming I just meant the kernel, that I was talking about the C library as well.

FWIW lots of people use "Operating System" to mean the set of software you get by default. OpenBSD, GNU, OSX, Android etc. definitely consider libc to be part of the OS, and Android probably considers its java class library part of the base OS. If everyone meant "the kernel" by "the OS" we'd just say "the kernel".

You don't have to agree with that, but correcting people on that point is about as futile as trying to convince people that your use of "hacker" is correct.

tinus_hn · on April 24, 2017

Surprisingly, when you remove the actual work, 100% of the time spent is overhead! What a waste!

pjmlp · on April 23, 2017

Which C library from which C compiler?

pjmlp · on April 23, 2017

That is compiler and OS specific behavior.

DSMan195276 · on April 23, 2017

I would only add that the expensive part isn't the allocation per se, but copying the data from one chunk to another. If you allocation 1024 when you only need 700, when you end-up needing 701 you don't have to copy all 700 bytes into a new chunk. This can add up pretty quick if you're copying the data every time you append one new element. If I appended 20 bytes to my array of 700 one at a time and copied each time, I'd end-up copying around 14K worth of bytes when everything is said and done.

But that said I would also keep in mind that `malloc` will generally be pretty fast - it's not going to hit up the OS every time you make a request. The advantage here isn't because `malloc` is slow, it is because copying is slow.

vardump · on April 23, 2017

Assuming 2 GHz CPU and reasonable assumption data is in L1 cache, reading 14 kB from L1 cache takes: 14kB / (32B*2) / 2 GHz = ~110 nanoseconds.

It'll take a bit longer (maybe 3-5x as long?) to write it to newly allocated memory (likely cache misses), the point is copying memory is just not as expensive as is often assumed.

I'm pretty sure malloc is the bottleneck in this case, pointer chasing is slow.

DSMan195276 · on April 24, 2017

I don't quite understand your argument - you give lots of numbers for how long it takes to copy, but ignore cache misses and just come to the conclusion that `malloc` is the bottleneck because "pointer chasing is slow". If you're going to give theoretical numbers for copy, why not give numbers for the "pointer chasing" as well? The numbers you gave don't really mean much if there is nothing to compare them too. Yes, 110 nanoseconds is fast, but copying a 700 byte array 20 times was never intended to be an example that would be grinding your computer to a halt.

You also don't give enough weight to cache misses in this scenario. A L1 cache miss (So the memory is in L2) is going to be approximately a ~15x slow down from L1. A full main memory hit will be about ~200x slower. Not nearly the 3-5x you're guessing. More-over, the cache missing in `malloc` will likely be the same memory you're allocating anyway, meaning that the cache misses were likely going to happen regardless.

With that said, the `malloc` time should be somewhat constant regardless of the size of the allocation (Unless your allocation requires requesting memory from the OS), but the copying won't be. Once you start getting into structures that are fairly large in size, `malloc`ing new memory for them anytime you expand them will end-up being a fairly expensive operation that I would easily wager dwarfs any time spent in `malloc`.

But I mean, it's easy to argue basically any of these points without actual testing, so I've tried running a few basic tests on my own. Obviously your results would vary (I'm currently running on a fairly slow CPU):

First, I ran tests on different block sizes and number of allocations without freeing any of the allocated memory. In these tests, the copying tended to overshadow the allocations without copying when your block size was above around 8k. But because I wasn't calling `free`, allocating would quickly eat up most of the time if the number of allocations was increased (With most of the time being spent in the kernel). I did most of my tests with around the 40000 allocations, and above that things would get quite slow with the allocations alone even with block sizes larger then 8k.

With calling free, however, things tend to lean much more heavily in favor of the copying being more expensive. Right around a block size of 1k is where the allocation costs about the same time as the copy. The speed-up is of course due to `malloc` always have enough free memory available and not having to ask the OS for more (So this test represents the basic cost of the "pointer chasing" alone). Also, like I had assumed, the time for `malloc` without copying is 100% dependent on the number of allocations - changing the size of the block doesn't make any change to the time without copy.

Now that said, these tests are by no means definitive, but at the very least they do show that `malloc` is not necessarily expensive compared to the copying, especially once you go above a few k in your `malloc` size, and/or when the total number of allocations starts to ramp up very high. If your allocation size is fairly small though, then I would agree the `malloc` time is at least equivalent (or more) then the time spent copying.

vardump · on April 24, 2017

I oversimplified, but I think my numbers are about in the ballpark when it comes to copying. It's fast, because CPUs are optimized for linear memory access patterns. I also took into account high likelihood the memory was touched before copying and will be touched again immediately after copying. One should consider cache misses on system level, not in isolation.

"malloc" will need for the very least to keep lists of free areas and try to avoid virtual memory fragmentation. Fragmented heaps (="real life code" situation) are significantly slower than fresh clean heaps, because the are more allocation list entries to check.

Because amount of work malloc needs to do is variable, it's hard to give any numbers how long that pointer chasing operation will take.

Pointer chasing is slow, because non-sequential access patterns, such as following pointers, create a data dependency and cannot be prefetched. The CPU is likely to stall waiting for the dependency to be resolved. Typically each pointer dereferencing ("chasing") operation will also touch a different cache line, so cache misses are also common.

Synthetic tests are way more likely to reside in L1 cache (including allocation lists) than anything in real life.

Many malloc implementations mmap or equivalent allocations that exceed a certain threshold. In that case, copying is slow; you'll pay a lot extra for page faults during the copy. Page faults might also occur on a fresh heap in a synthetic test.

I know from personal experience lots of small size mallocs sometimes cause non-trivial performance issues.

Standard disclaimer concerning any optimizations: Always profile your code. Things work differently in real code than in benchmarks, and what was true last time in your real life scenario is no way guaranteed to be true next time.