Just to emphasize the truly remarkable thing here: the winning (SIMD) solution takes about 1ns per word sorted -- just over 3 machine cycles to sort 16 nibbles. Note that even extracting a nibble from a word in the naive manner would take about the same amount of time. The speedup from using SIMD registers is tremendous, and should be a lesson to anyone who spends their time micro-optimizing hot functions.
This kind of blog posts are super interesting and depressing at the same time :) makes me question my programing skills. Almost all of the things mentioned here are over my head.
Sometimes programming feels like: "Interesting work, useful work. Pick one". People who get to do both are privileged indeed, the rest of us sit around feeling either bored or useless.
I'm thankful at least, in contrast to several friends who suffer from depression and are not prone to nerd sniping, that, for me, these two states are almost mutually exclusive.
I was a bit disappointed to see no code in this blog post for the algorithms used to generating the pairs. Is there are templated C++ library for this?
I don't think the network in this case was generated by an algorithm. According to Knuth, this network was discovered by M. W. Green and it's still the smallest (60 comparators) 16-input network known.
For this problem, I wouldn't worry too much about register usage. The x86-64 ISA has 16 GP and 16 FP registers, but modern implementations use register renaming[1] to give many more under the hood. For example, Haswell has 168 GP and 168 FP entries in its register file.[2] This contest was purely about speed, so one should favor using as many registers as possible when making any time-memory tradeoffs.
These solutions do not scale at all. Even with tweaks, they could not sort 18 nibbles (or 32 nibbles). But they are very interesting, and these kinds of tricks do find some practical use in things like video codecs.