Principles of high-performance programs (2012)

arielweisberg · on May 21, 2014

Reading stuff like: "is the eviction of your cache by the kernel while executing the system call."

Makes me suspicious. See http://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-fl...

Some of this seems focused on working around broken primitives and frameworks.

Concurrency primitives in Java will not go to the kernel unless a thread actually has to park, and if you use an executor service or similar threads will not go to sleep if a task queue is not empty. I suspect pthread primitives are also careful to avoid coordination when none is needed.

I don't actually see how a task queue would unschedule a consumer for each element if the queue is not empty. What is going to wake up the thread that is blocking on a non-empty queue? I guess I am spoiled by having queues solved well in a standard library.

Context switching is also something that comes up a lot as the sort of thing that is expensive. Context switching is only expensive if task size is small relative to the cost of a context switch which isn't that large. My experience is that for small non-blocking tasks you can run thread per core to expose the available parallelism and everything else will tolerate context switching.

I am also perpetually hearing about the importance of hot caches and not switching threads. Caches are just tracking state for tasks, and unless you have actually done something to create locality between tasks there is nothing to make them stay hotter anyways.

If the state the cache is tracking is multiple thread stacks, well... the CPU doesn't know the difference between data on a stack and data that it is chasing through some pointer.

The real problem is having a task migrate to a different CPU instead of waiting its turn in the right spot and that can be solved other ways.

Access pattern matters as well. If you are going to sequentially process buffers then prefetching will work and there is no benefit to a hot cache. That is where the emphasis on zero copy tends to show holes. Think about the difference in speed between RAM and a network or disk interface, and how much processing you are going to do beyond just the copying.

My main beef is that pushing this kind of performance thinking without providing measurements that show where it does and doesn't matter encourages the kind of premature optimization that isn't productive.

dllthomas · on May 22, 2014

"Caches are just tracking state for tasks, and unless you have actually done something to create locality between tasks there is nothing to make them stay hotter anyways."

This is only true if your data is small enough that it never gets evicted. Otherwise, there are certainly things that can make them stay hotter without involving multiple cores: You can shrink your data, reorder your data, or reorder your traversal of your data.

hyc_symas · on May 22, 2014

One problem with "read everything on the socket until it's empty" is an issue of fairness. A particularly greedy client could keep its connection filled with requests, and this approach would monopolize a server thread, potentially starving other clients/sockets.

hyc_symas · on May 22, 2014

Btw, that's not just theoretical. The connection manager in OpenLDAP slapd used to do exactly that - and then we found exactly that problem while benchmarking/soak testing. Now we always process only one request from a connection before moving on to the next connection. Not as efficient, but more fair.

There are always tradeoffs. Time sharing, multi-processing, and time-slicing are all inherently inefficient. Batch mode processing, where 100% of system resources are dedicated to a single job until it completes, is most efficient and gets jobs done quicker. We accept this efficiency tradeoff because we believe that interactive processing is more user-friendly.

So take the quest for efficiency with a grain of salt. Perfect efficiency may actually give a worse user experience.

On the flip side, we have LMDB, a read-optimized database engine which does no system calls or blocking library calls at all in the read codepath. It is as nearly perfectly efficient as one can get without resorting to hand-optimized machine language. Fairness inside LMDB isn't a concern, because calling apps are always going to trigger context switches on their own anyway.

The quest for efficiency has different constraints depending on what level of a system you're programming in.

hvidgaard · on May 22, 2014

1 req pr connection at a time, while fair, seems a bit harsh. From a performance perspective it's not a bad idea to process up to n reqs in at most m time from a connection before moving on to the next, and thus amortize the cost from switching connection.

robotresearcher · on May 22, 2014

The issue of magic numbers like this n is discussed in the article. n=1 has the benefit of perfect fairness. All other values have to be tweaked for the platform.

hvidgaard · on May 23, 2014

The article also explicitly say you should amortize the cost of things like switching context, which switching connection is.

I never stated that n and m has to be a magic numbers. They can be can (and should be) adaptive. While n=1 does provide perfect fairness, but what if switching connection cost just as much as processing 1 request? Setting n = 2 increase throughput by 50%, at the cost of 33% longer wait until the last connection is handled. However, because the throughput is increased, any subsequent requests will be handled faster than with n = 1.

In reality you want a sane m value, the time allowed to cycle through all connections. I'm not sure exactly how to make this adaptive, but it's likely very dependent on the nature of the usage (http connections from a GUI, p2p network, or something else). As long as you cycle through all connections within m time and aren't saturating the connection, the algorithm can increase n to increase throughput.

frankzinger · on May 22, 2014

Not if you receive as much as possible at once (i.e., the size of the socket's receive buffer), and process the N PDUs at your leisure. (By "process at your leisure" I mean process the PDUs, of course, but also service other clients.)

While the server is "processing", the client can only send enough request data to fill the socket's receive buffer again (around 300K bytes on my machine over the loopback interface, and it can be adjusted if it's too large). When that buffer becomes full, TCP's flow control kicks in and prevents the client's TCP stack from sending any more data, which will cause the client's socket send buffer to fill up, at which point it will be prevented from sending any more request data (send() will block or keep returning EWOULDBLOCK/EAGAIN if using non-blocking operations).

When you've eventually processed all of the PDUs, call recv() again, draining the next 300K bytes. Then the client will be able to start sending request data again.

This avoids the resource-hogging problem you described and keeps the user mode:kernel mode ratio as high as possible, which is what the article was talking about.

jcrites · on May 21, 2014

> Imagine a world where all system calls are asynchronous, all events and system calls return values are posted onto a message queue and you could drain the message queue with a single interaction with the kernel.

Interesting idea. Have existing OSes explored this? What are the major technical challenges and proposed solutions? Questions and ideas:

One option is to design every such API so that it can accept a batch of work. An array of structures representing the parameters to individual logical API calls. Invoke a single API call to hand off the array. Perhaps that will provide most of the achievable benefit, since I imagine these performance-intensive applications are bottlenecked on high volumes of calls to small numbers of APIs.

It would be a pain to modify every API interface in this fashion, and I imagine it will be difficult to implement those APIs.

Could the kernel provide a single batch facility accepting as input an array of structures representing API calls? The goal would be to make it possible to implement those APIs mostly like normal, such that the kernel takes responsibility for fanning out the batch into individual API invocations, in the simplest case by executing them sequentially; though the implementation of the API could also take advantage of batching as well.

Success of the calls would need to be determined through an API like select, or alternatively through IO events on the individual file descriptors being worked-on. There will need to be a canonical format for specifying the API to invoke (function pointer?) and its arguments (calling convention). I imagine this could be quite challenging to implement in general; all state will need to be managed explicitly. For example, you could not use a calling convention where the caller is expected to use another API to get details about why the last call failed (like Windows `GetLastError`), since the state will be clobbered by subsequent calls in a batch.

In terms of a mechanism, could the OS and applications communicate through concurrent data structures like the disruptor pattern rather than by direct context switching? It seems like this would potentially require passing input/output through main memory (if crossing CPUs). Unsure if this would provide a net benefit.

From a naive perspective this all seems feasible, but also sounds like a huge amount of work and the primary benefit would be for the highest of high-performance applications. Though I wonder how much overall performance would improve for typical applications if we can minimize context switching.

Potentially relevant: https://news.ycombinator.com/item?id=7679822 - Linus Torvalds on the high cost of page faults, which are more likely to happen under frequent context switching.

FooBarWidget · on May 22, 2014

Yes, this idea has been explored. See the paper FlexSC: Flexible System Call Scheduling with Exception-Less System Calls (Soares & Stumm 2010), https://www.usenix.org/legacy/events/osdi10/tech/full_papers.... In addition to having async system calls, they dedicate an entire core for processing system calls to the kernel, in order to minimize CPU cache thrashing. They've implemented this idea on Linux and sped up Apache by 116% (so 216% the original performance), MySQL by 140%. Unfortunately it hasn't caught on in production anywhere.

ehsanu1 · on May 22, 2014

Nitpick: they improved MySQL performance by 40% (so 140% of original performance), which is still very impressive.

It's also amazing that they were able to make it work transparently for applications. The only way it would catch on is if it were upstreamed though I guess.

rdtsc · on May 22, 2014

> One option is to design every such API so that it can accept a batch of work.

There is some of that with scatter-gather APIs like writev and readv system calls.

AIO is there on Linux for disk access.

Then for sending data to a socket there is sendfile() system call. Kinds of lets the kernel handle the copying.

> In terms of a mechanism, could the OS and applications communicate through concurrent data structures like the disruptor pattern rather than by direct context switching?

There are some already. Mostly involving read-only data. Those are implemented via pages mapped to processes' memory. These are called vDSOs. gettimeofday() call is one such instance.

Other instances of that is for capturing network packets often use a ring of shared memory for getting data out.

Another way to do this, instead re-creating all the APIs from scratch is to optimize schedulers more. Make them cache (code and data) aware. In userland you can set CPU affinity of your processes. Maybe match interrupt affinity with your CPU affinity to make sure data is present in the cache.

adamnemecek · on May 21, 2014

I believe that BeOS (and now Haiku) explored some ideas like that. I think that all (most?) OS APIs were async and message queues were used to call the APIs as well as return values from them (someone correct me if I'm wrong). This resulted in a really snappy GUI but that's about all I know about it. One major hurdle was that programming for the OS became tricky.

There's this free book about BeOS which I've been meaning to read http://oreilly.com/openbook/beosprog/book/ but never quite got around to it. You might find answers to some of your questions there.

oofabz · on May 21, 2014

>design every such API so that it can accept a batch of work

I've realized some significant performance gains with this approach. The benefit was locking once to process many messages, instead of locking for every message.

nostrademons · on May 22, 2014

Wasn't win32 done like that? I have vague memories of PostMessage/GetMessage/TranslateMessage/DispatchMessage loops. I never did enough Windows development to understand where the actual syscall boundaries were, but since everything was done by message passing to the main loop of your application, the API seems amenable to pumping out multiple messages in one syscall.

FooBarWidget · on May 22, 2014

Windows's messaging API is indeed asynchronous, but they're high-level APIs built on top of synchronous system calls, and therefore do not improve performance as much as they could. For real asynchronous system calls, see the FlexSC paper I linked in my other comment.

yuhong · on May 22, 2014

Currently PostMessage/GetMessage I think are syscalls, TranslateMessage/DispatchMessage are not most of the time.

quotemstr · on May 22, 2014

DefWindowProc is a system call. Really, switching to kernel mode is not as expensive as people suppose.

yuhong · on May 22, 2014

DefWindowProc is not always a system call. It depends on the message being tried for example.

hyc_symas · on May 22, 2014

VMS had really great support for async I/O. POSIX aio is pretty anemic in comparison.

anon4 · on May 22, 2014

It would be interesting to see if/when kdbus catches on if it will result in more applications written in that manner. A userspace daemon talking asynchronously to a thread-pool executor that dispatches tasks and returns results.

dschiptsov · on May 22, 2014

Are they saying that threads and GC are crap, so to write a predictable code one should think in terms of pre-allocared data [structures] and partitioned [block] I/O requests?)

This is a big news..) As far as I remember Informix Dynamic Servet 7.30 has been released 15 yeas ago..