Speeding up independent binary searches by interleaving them

zawerf · on Sept 15, 2019

I was expecting this to be about fractional cascading which is a really clever trick: http://blog.ezyang.com/2012/03/you-could-have-invented-fract...

js8 · on Sept 15, 2019

I am not sure why is he talking about branching prediction when this is really about caching. The next integer to look at is unlikely to be in the CPU data cache, and so each time we have to go to main memory. If we interleave the searches, we interleave the memory requests, and get higher throughput.

Or am I wrong?

nkurz · on Sept 15, 2019

He's talking about branch prediction because it's intimately tied with speculative execution. He's trying to explain why the standard 'branchy' version of binary search has faster performance than one might naively expect, thus why the speedup from increasing MLP (memory level parallism) is less than one might expect.

When the processor reaches a conditional branch (the 'if' statement comparing the test value), rather than waiting for the actual results of the comparison to be known (and thus waiting for the test value to be loaded), it blindly guesses one of the branches and continues executing instructions as fast as it can. It doesn't commit the results of these instructions ('retire' the instructions) until the real answer is known, but crucially, it executes the loads sooner than they would otherwise be executed.

Most of these early loads are along the wrong path and the load is wasted, but half the time, the first guess is right, and half the time, the branch after that, etc. The end result is that branchy binary search ends up roughly twice as fast as a branchless version that waits for the loads to complete and efficiently makes only the loads that need to be made.

cshenton · on Sept 15, 2019

If I had to hazard a guess, I’d say the conditional move means that both next values are prefetched into cache, whereas a branch means that fetch doesn’t happen until the comparison is made.

nkurz · on Sept 15, 2019

Unfortunately I think that's backwards on both sides.

Conditional moves create a data dependency, and thus do not involve speculative execution. While the program can explicitly try to preload both possible values (and can benefit from doing so) the processor doesn't ever do this automatically. Other instructions can still be executed (out-of-order) but the path of execution does not change.

With a branch, rather than waiting for the results of the comparison to be known, the processor guesses at the result and chooses a path of execution. It thus executes the fetch immediately, but (about) half the time it's the wrong fetch. If after the original load completes it realizes that it guessed the wrong side of the branch, an 'exception' occurs, the speculative work is thrown away, and the execution begins again with the correct branch.

boulos · on Sept 15, 2019

I think I’m missing something. Under what conditions do you want to search through several identically sized lists independently? (I had assumed from the title that this would be about performing many independent queries against the same list, optionally with SIMD, instead of one query against many lists).

It wouldn’t be a huge change to also make the N-value “per query”, but then you get a lot more branches in the inner loop. Even then though, I’m struggling to remember situations where I thought “I have a dozen lists where I need to do binary search for each of them”.

kjeetgill · on Sept 15, 2019

It definitely comes up in search-index style posting lists. For dense postings you'll usually deal with bitmaps, but you'll have sorted integers for sparse ones and you usually have at least one list per search term.

There's all sorts database indexing problems that can use this too, I'm sure.

dehrmann · on Sept 15, 2019

You could also use SIMD instructions, and there are bitwise operator tricks that might be able to mostly avoid branching.

nkurz · on Sept 15, 2019

I'll ask you an open question: do you need to use SIMD to get maximum speed on this problem? I think the answer is yes, but only because with a certain level of MLP you run out of architectural registers and your compiler starts swapping things to stack. Daniel thinks (hopes?) the answer is no, and that with sufficient finesse you can get the same speed without SIMD.

The problem (well, at least very similar problems that we've thought about more) ends up running at very low IPC, with most of the time being spent waiting for data from RAM or L3. You can fit a lot of extra instructions in without seeing a slowdown. If there is a benefit to SIMD, both of us think it will be a fairly small effect, and only at very high levels of MLP. But we'd be excited to be proven wrong!

amelius · on Sept 14, 2019

Shouldn't the CPU just switch to a different thread while it is waiting for a memory request to be fulfilled? So you could just run binary searches in different threads.

nkurz · on Sept 14, 2019

It's a good question, but I think that the problem is that the overhead of switching between threads is much greater than the number of cycles wasted by waiting for the memory request. I think the fastest context switches on Linux are around 1 µs (1/1000000 of a second, or about 4,000 cycles)[1], whereas a request from RAM is around 25 ns (1/40 of a µs, or about 100 cycles)[2].

It's possible my numbers are off by a bit, but I think this means that on a standard CPU, you really need to parallelize the search more explicitly rather than letting the OS handle it for you. GPU's on the other hand do take a similar "brute force" approach to context switching, and clearly have good results on some similar problems.

[1] https://eli.thegreenplace.net/2018/measuring-context-switchi...

[2] https://www.anandtech.com/show/9483/intel-skylake-review-670...

magicalhippo · on Sept 14, 2019

This is what Hyper-threading[1] is designed to do.

The point is that it's expensive for the OS to switch threads, so the CPU instead tells the OS it can run two threads at the same time, and does so by sharing execution resources.

Gives a nice boost for programs that do a fair bit of calculations while also having a fair share of cache misses. If you have tight loops with little memory access you won't gain much.

[1]: https://en.wikipedia.org/wiki/Hyper-threading

foota · on Sept 14, 2019

Context switches are extremely expensive, no way would it make sense to do it every memory access.

Edit: https://eli.thegreenplace.net/2018/measuring-context-switchi... cites a couple microseconds, whereas memory latency is measured in tens of nanoseconds. So several dozen times more expensive to context switch.

phoe-krk · on Sept 14, 2019

I am not sure if context switches between different threads aren't going to dominate the cost here. It does sound like a little bit of assembly code that performs a binary search has a much smaller cost than all the stack manipulations associated with a thread preemption.

asdfasgasdgasdg · on Sept 14, 2019

As others have stated, thread switching overhead is much higher than even the cost of a main memory fetch, so we can't afford to switch a thread on every pipeline stall. Numbers Every Programmer Should Know: https://people.eecs.berkeley.edu/~rcs/research/interactive_l... (Although I see that this does not include thread switching latency, possibly because it depends on the OS and thread type. I'll leave the link because it's generally useful.)

gridspy · on Sept 15, 2019

This is actually exactly what is going on in this example. The CPU is able to execute instructions from the various searches in whatever order the data arrives.

You can learn more about the process here : https://en.wikipedia.org/wiki/Out-of-order_execution

Because the process does not create full "threads," the CPU doesn't need to spend time doing things like saving registers or figuring out what to execute next.

wmf · on Sept 15, 2019

This looks similar to a technique used in Cisco VPP to process batches of packets very efficiently; it's a cool way to cheat the pipeline.

yxhuvud · on Sept 15, 2019

What I don't understand though is why it wouldn't make sense to sort the keys that are looked for first and look them up in order while traversing the tree. Seems much better from a cacheing perspective..

EDIT: Ah, each look-up was in different trees. That takes away that optimization.

bhouston · on Sept 14, 2019

Sounds like at the surface like ray bundles that are used in ray tracing acceleration.

WhitneyLand · on Sept 14, 2019

The article doesn't seem to consider real life practicality enough. For example, what speed up do you get with this, over caching the most frequently accessed million data items using a constant time lookup?

The answer of course, is it heavily depends on your problem space and data size and characteristics.

Caching is somewhat discussed, and I get that the topic is not data structures and algorithms; However it seems impossible to say when this approach would be useful in general without more practical considerations.

How easy would it be maintain or enhance this kind of code? Would you get emails like "...we have a few bugs in the 16-way interleaved multithreaded binary search code, who wants quickly get those cleaned up...".

asdfasgasdgasdg · on Sept 14, 2019

> The article doesn't seem to consider real life practicality enough.

It considers it plenty for an article in the category "neat trick/wrinkle of processor caches and pipelining you may not have known/thought about to its limit." If you're actually doing binary searches on sets large enough to where this would matter, most often some kind of hash based solution is going to be preferable anyway, because of its much higher degree of cache efficiency and better asymptotic performance. That's not this article's purpose, though. Not every article is a tutorial.

WhitneyLand · on Sept 14, 2019

Who said it should be a tutorial? I agree on that point it's not a how to, and a tutorial is probably a contradiction for the topic.

That doesn't mean it can't discuss practicalities. Plenty of advanced papers, blog posts, documentation, experiments, whatever, discuss engineering realities because they feel like it.

The reality is some things are less intuitive and require more outside the problem thinking than others.

If someone is writing about how binary trees work alone, not concurrent programming as well as is done here, there's little point to considering applications because it's simply useful to know and apply the concept so often.

However this one is an edge case, and even when you need it I would doubt many seasoned engineers wouldn't still benefit from contextual considerations from the author.

dmurray · on Sept 14, 2019

> How easy would it be maintain or enhance this kind of code? Would you get emails like "...we have a few bugs in the 16-way interleaved multithreaded binary search code, who wants quickly get those cleaned up...".

Harder than the naive binary search, but that's a good thing, because the guy who maintains the library code to do these parallel searches likes hard problems in low level programming.