Faster than radix sort: Kirkpatrick-Reisch sorting

vanderZwan · on June 9, 2020

So since this sorting algorithm involves a trie, would there another optimization possibility by using a data structure inspired by the MergedTrie?

My first thought would be to split the list of numbers into a prefix and a suffix part and building two tries connected at the leaves[1][2], replacing the trie used in the article. Then we sort both tries using the Kirkpatrick-Reisch method (but in reverse order for the suffix trie so that the final result is sorted correctly), and finally we would have to reconnect the two while walking the tries.

[0] https://journals.plos.org/plosone/article?id=10.1371/journal...

[1] more or less, the MT in the linked paper works a bit differently but also has a different use-case in mind.

[2] Also I have no idea if it make sense to have two depth 2 tries, or if there is another algorithm out there with two depth 1 tries that _kind_ of looks like this algorithm

vanderZwan · on June 10, 2020

So I tried working this out on paper. The simplest variation I could think of:

- split the numbers into a top and bottom half (from now on: prefix and suffix) (linear time)

- make an unordered suffix trie (linear time). First level has suffixes as, second level has prefixes

- make a (recursively sorted) ordered prefix set, and a (recursively sorted) ordered suffix set

- initiate an ordered prefix trie, but only the first level for now - that is, don't insert suffixes yet (linear time over the ordered prefix set)

- in order of the ordered suffix set, walk over the suffix trie and for each prefix leaf insert the parent suffix into the appropriate prefix bucket in the prefix trie (linear time)

- now we can walk the prefix trie in order and combine prefix and suffix again (like in the article)

This feels like it should have comparable computational complexity - as far as I can see the only real difference is that it recursively sorts twice as often (once for the prefix set and once for the suffix set). Either way it still seems to have horrible memory overhead, requiring a trie for each level of recursion and all that.

Then I realized that if we are at the base case where prefix/suffix can be sorted with a counting sort, then the above can actually be simplified to LSB radix sort where we sort the suffixes into a temporary secondary array, and the prefixes from the secondary array into the original array (I think we can safely say that using a plain array of n elements has both lower memory overhead and better computational performance than a trie with n leaves). But... couldn't I then optimize the entire recursion into an LSB radix sort? Which would imply it must have... worse time complexity than Kirkpatrick-Reisch sorting? Wait what? Where did I go wrong then?

olliej · on June 9, 2020

I suspect memory indirection would clobber the theoretical perf, but I'd be happy to be proved wrong.

My inclination is that this would be slower than "standard" high perf radix sorting, but I'm not sure if the high level overview of this algorithm represents an equivalent level of implementation.

oxxoxoxooo · on June 9, 2020

If you are into integer sorting, this might be of interest as well:

https://yourbasic.org/algorithms/fastest-sorting-algorithm/

https://sorting.cr.yp.to/

nathell · on June 9, 2020

Written by Tomek Czajka, a 3x TopCoder winner and algorithmic mastermind. Worth following!

mirekrusin · on June 9, 2020

Remember him at high school programming olympiads, top place year after year (also on math olympiads and likely other competitions I'd have to recall), everybody admired him.

nXqd · on June 9, 2020

And he's a spaceX engineer

1wd · on June 9, 2020

O(n+n * log(w/log(n)) )

Wouldn't this decrease again for large enough n, and even go negative after n=2^(w * 2)?

dan-robertson · on June 9, 2020

The algorithm is to switch to a counting sort when w <= log n, ie n >= 2^w, so more properly the complexity is written:

  O(n+max(0,log(w/log n)))

karpierz · on June 9, 2020

The recursion assumes that log(n) > w; if log(n) <= w, then you're in the base case and it's O(n).

rocqua · on June 9, 2020

Not sure whether it applies here, but _if_ n is the number of unique values, you are limited here by the fact that there are only 2^w unique integers. Hence n < 2^w

ben-schaaf · on June 9, 2020

Not sure what's going on here, but that does indeed seem to be the case: https://www.wolframalpha.com/input/?i=x%2Bx+*+log%282%2Flog%...

cwzwarich · on June 9, 2020

> Faster

Benchmarks?

beagle3 · on June 9, 2020

In the big-Oh algorithmic complexity sense; in a loose sense, for any pair of implementations (radix sort, kr sort) there exists a word size w and a list size n such that If either w or n increases, the time for radix sort would increase more quickly than for Kr sort - and this, eventually kr would be faster and keep getting faster. (Assuming that the hash can indeed yield average O(1) access, which is probabilistically but not deterministically true)

That said, word size w is, in almost all integer dieting problems, bounded by 128 (by 64 or even 32 with high probability) which makes it acceptable to regard as “constant” in which case both sorts are essentially O(n) and it all depends on specific implementations (with radix sort likely significantly faster in practice)

adrianN · on June 9, 2020

Big-O as commonly used in the CS literature sometimes doesn't translate to Big-O on actual computers. For example virtual memory translation can add a log term where you wouldn't expect it: https://pdfs.semanticscholar.org/1e90/c55362cf7793dc0b2521f6...

lorenzhs · on June 9, 2020

That's a common misunderstanding of what "Big-O" is, but rephrased as "the models used for algorithm analysis assume very simple machines that don't represent actual computers all that well" your point is very valid. There's a whole lot of things that our computers do that aren't accounted for in the RAM model (which is most commonly used when analysing sequential algorithms). Memory hierarchies are a big one (the external memory model accounts for a two-level hierarchy, and can be used to reason about cache usage or external memory ("out-of-core")), but other things such as branch (mis-)predictions or virtual memory translation (TLB) are rarely accounted for. That's what the field of Algorithm Engineering is about: designing algorithms that have good worst case guarantees ("Big-O") but that are also really fast when implemented on real-world hardware. (Given the publication list on your website, you probably know all this, but I wanted to expand on it for others)

Certhas · on June 9, 2020

There is the idea that you should treat memory access as an O(N^.5) operation:

https://github.com/emilk/ram_bench

I am not sure if any serious academic work has been built on this model, but it's a nice short hand.

gnramires · on June 9, 2020

This is somewhat universal. (Some physical insights)

Naively, to achieve optimal access time, you can pack your memory within a sphere of radius R, and R=O(N^(1/3)).

But, for large R you start having cooling problems. If each memory element needs some power P to operate, then the total power consumption is P×N = O(R^3). But your area is only 4pi R^2, so the power flow per unit area is O(R)=O(N^(1/3)). So if it has large radius, and it has limited thermal conductivity, your memory will melt (since temperature ~ power flow^(1/3) (Plank's law)).

The threshold for stable temperatures at any radius is memory access as O(N^(1/2)).

This analysis is valid for general computing and circuits, but since computers are usually modeled as memory machines I think that's sufficient (?).

Obs: Why, or how, is the human brain roughly spherical then? Because we have a very effective (water based) cooling system. Still, if it got large enough, and you admit limited flow rates of water and such, cooling eventually would be limiting. If you immediately thought of elephants, so did I, and this may be linked to their fantastic large ears:

https://asknature.org/strategy/large-ears-aid-cooling/

I love how everything is connected.

Obs2: Yes this is related to the Bekenstein bound, but much more relevant of course (because existing RAM is almost thermally limited and you need black hole densities to achieve bekenstein bound). The memories we use are organized in (mostly) flat packages.

Certhas · on June 10, 2020

The true spherical cow model of circuits.

I do wonder though if this is really the mechanism behind the observed N^.5 law. As you allude to with Bekenstein, just because there is an eventual physical limit doesn't mean the structure of real hardware mirrors it.

Also, we are not limited to dissipation to transport heat away...

lorenzhs · on June 9, 2020

Well the problem with that is "this is a curve that roughly fits the data" is a bad way to go about constructing a model. It's a useful and neat observation, but that doesn't make it a good model. It might model random accesses to memory reasonably well (such as traversing a linked list, the example used there), but it doesn't model a scan over an array well. That doesn't make for a useful model. In contrast, the external memory model assumes a fast internal memory of size M (e.g., cache), which can be accessed in constant time, and a slow external memory of infinite size (e.g., RAM) which can only be accessed in blocks of size B (e.g., a cache line). Then you count how many blocks the algorithm needs to read or write. Now, scanning an array of size N takes O(N/B) I/Os, whereas scanning a linked list of the same size takes O(N) I/Os. The complexity of sorting is O(N/B log(N/B)/log(M/B)) I/Os. This models the same behaviour in a much cleaner way, applies equally to all levels of the memory hierarchy (you can also view RAM as the internal memory and a hard disk/SSD as the external memory), and is widely used in what you called "serious academic work" :) See also https://en.wikipedia.org/wiki/External_memory_algorithm

Furthermore, the introduction in the article you linked misunderstands Big-O notation so incredibly fundamentally that I don't think the author has done their background reading on machine models and Big-O notation.

Certhas · on June 10, 2020

It is not obvious that a two level model is a cleaner way to think about todays memory access, which has 4-5 levels of caches before you even hit possibly NUMA RAM, then an SSD, then a HDD and then maybe big datasets that can only be accessed over the network.

But then, I am a physicist, not an engineer, so to me starting from empirical observations is actually a very good way to construct a model.

lorenzhs · on June 12, 2020

Well you can apply it to any pair of (adjacent) levels of the memory hierarchy. But the main problem with the square root model is that it only models random access time, but not when they are incurred and when data is already in cache. (There are also 2-3 levels of caches, no architecture that I’m aware of has more than 3, maybe 4 if you count the CPU registers but their allocation is usually fixed at compile time)

naasking · on June 9, 2020

> I am not sure if any serious academic work has been built on this model, but it's a nice short hand.

Not that nmodel specifically, but cache-oblivious data structures are specifically designed to scale well in a hierarchical cache model, no matter the cache block size. So they scale excellently across L1 cache all the way down to hard disk.

kurlberg · on June 9, 2020

An amusing, surprising and in hindsight obvious lower bound for average random access speed in an array containing N "words" is N^(1/3) (cube root of N.) I.e., for any realizable computer without new physics.

jnxx · on June 9, 2020

Yeah. I am wondering since a while whether well-known algorithms like the binary heap are still efficient on modern architectures, because their random memory access patterns.

lorenzhs · on June 9, 2020

For priority queues, which implementation is optimal depends a lot on your workload (do you need addressability, i.e., a decreaseKey operation? What is the typical ratio of insertions to deletions? Are your keys integers?). I found some slide decks that are mostly in English with some German in between that might go some way to answering your question:

[1] https://algo2.iti.kit.edu/sanders/courses/algen20/vorlesung.... slide 180-207, there are some up-to-date measurements on slide 204

[2] https://nms.kcl.ac.uk/stefan.edelkamp/lectures/ae/slide/AE-A...

There are also parallel priority queues which might be useful depending on your problem, especially if it can be reformulated to operate on batches ("give me the 20 smallest items", "insert these 50 items").

contravariant · on June 9, 2020

Look up cache-oblivious algorithms. Turns out that random memory access can often be improved upon with smart data structures.

the8472 · on June 9, 2020

The paper mentions huge tables but considers their use uncommon, that has changed since then, at least on linux.

vvanders · on June 9, 2020

Yeah, would be curious as well. There's two really awesome things about radix sort:

1. It scans in linear order, so if you tune your radix size to L1/L2 cache it will happily beat other "faster" algorithms thanks to the prefetcher.

2. If preserves ordering for keys with the same value.

#2 makes is a really good depth-sorting algorithm for alpha rendering, and #1 just makes it darn fast. There's a nice floating point implementation out there for it as well.

corysama · on June 9, 2020

vvanders, I believe you’ve worked in games so you might already know about how the PlayStation 1 kindof had radix sort baked into the hardware. The hardware had no Z buffer, so all polygons had to be ordered back-to-front using the Painter’s Algorithm for visibility. The hardware understood a linked list of polygons; as odd as that sounds. And, the standard practice presented by the API was to have a pre-allocated linear array of NOP list nodes forming a radix as the starting point for inserting sorted polys.

beagle3 · on June 9, 2020

> it will happily beat other "faster" algorithms

When it applies, there are essentially no faster algorithms - it’s O(n) if the word size is constant (it often is), which cannot be beat asymptotically. kr sort is only asymptotically better if word size is considered variable.

It’s irrelevant if you have no radix to sort on - comparison sort is provably at least O(n log n) which is slower.

moonchild · on June 9, 2020

> It preserves ordering for keys with the same value

That's called a stable sort, and it's a standard property present in many sorting algorithms.

zem · on June 9, 2020

faster in the algorithmic rather than the performance sense

rumanator · on June 9, 2020

That's not what "faster" means. Computational complexity means expected asymptotic behaviour followig certain assumptions. More often than not don't happen in the real world, or don't take in consideration real-world properties such as tiered cache and the impact of cache misses.

MaxBarraclough · on June 9, 2020

You're being downvoted, but it seems like a fair point.

My favourite example: multiplication of very large square matrices. The algorithms with the best time complexity are never used in the real world. Under no realistic circumstances would they offer the best performance.

In that case, it's not cache behaviour or branch-prediction that dooms the algorithm, but very intensive steps in the algorithm that are nonetheless constant-time, and so don't impact the complexity theoretic properties.

Do we describe such algorithms as the 'fastest'? I wouldn't. They have the lowest time complexity. Not the same thing.

https://en.wikipedia.org/wiki/Galactic_algorithm

lorenzhs · on June 9, 2020

Same as approximately nobody (who did their homework) uses Fibonacci heaps, even though their theoretical running time is fantastic. Turns out that using a binary heap is faster in practice (and binary heaps are far from optimal, see my sibling comment).

Models are there to enable us to reason about things that are too complex to reason about directly. They're always going to lose information. Sometimes using a different model is the right answer, and sometimes you need practical experiments to determine what actually works.

MaxBarraclough · on June 9, 2020

> nobody (who did their homework) uses Fibonacci heaps, even though their theoretical running time is fantastic

Sounds similar to the 'Brodal queue'.

elwes5 · on June 9, 2020

Big O gets you in the right ballpark of what to look at. That extra 'C' bit that gets left out can doom it to be worse than other items though.

For example for very small sets of numbers you could basically pre-sort every combination there is then have a very large lookup table. Much like a rainbow table for passwords. Your O is basically a binary search lookup O(log(n)) or even better O(1) if you can make it linear. However, the upfront cost is huge and storage cost is huge, lookup would probably be big too. Hmm, now that I think about it this could be an interesting thing to mess with. Also at this time anything past 4 bytes would be unusable. Hmm, maybe later. Like you point out a 'galactic alg'. Portions of that thinking can be pulled out and used for other items though.

With sorting using the comparison is a good proxy for if it might perform well. But it is just that, a proxy. Like the weird sort I just made up. There is probably a few shifts and muls in there. Those are decently expensive and can blow your perf right out of the water.

MaxBarraclough · on June 9, 2020

> Big O gets you in the right ballpark of what to look at

Generally I'd say that's true, but even that depends on context. For sorting very small arrays, on typical hardware, you can't beat bubble-sort and insertion-sort.

elwes5 · on June 10, 2020

Oh absolutely. The bubble sort sort thing usually comes down to the architecture of the machine. One of the things the O notation kind of hand waves away. On paper some things are faster. But put in 3 levels of cache, a CPU scheduler, a particular ASM instruction flow that makes things faster/slower and suddenly things are different. That is my biggest gripe with the notation. It is good to get you 'close'. But sometimes you just need to fiddle with it and try it. The 'C' bit can get you. On paper bubble sort is always worse. But it can run faster for small sets because the code and is small enough to fit into L1. Whereas maybe a mergesort implementation either the code or the data fits but not both.

cycomanic · on June 9, 2020

TIL that the fastest way to multiply two numbers is by using a 1729-dimensional Fourier transform. Thanks!

MaxBarraclough · on June 9, 2020

For a less impractical but still mind-bending algorithm, see Strassen's algorithm for large matrix multiplication.

https://www.geeksforgeeks.org/strassens-matrix-multiplicatio... , see also https://youtube.com/watch?v=ORrM-aSNZUs

andrepd · on June 9, 2020

Author knows that:

> To achieve agreement when discussing the speed of a sorting algorithm, it’s necessary to establish some common ground. How do we measure speed? There is a well-established model – the von Neumann computer, or unit-cost RAM model – widely used by computer scientists to evaluate algorithms without introducing all the complexities of real-world computers. This model doesn’t account for some important things, such as the performance gap between RAM and CPU, but typically gives good approximations of real-life performance.

zem · on June 9, 2020

it's just a useful and commonly-understood shorthand for "has better asymptotic complexity"

xiaodai · on June 9, 2020

I might be missing something but radix sort I can sort a 64 bit vector 11 bits at a time.