Nibble Sort

davidtgoldblatt · on July 21, 2015

Just to emphasize the truly remarkable thing here: the winning (SIMD) solution takes about 1ns per word sorted -- just over 3 machine cycles to sort 16 nibbles. Note that even extracting a nibble from a word in the naive manner would take about the same amount of time. The speedup from using SIMD registers is tremendous, and should be a lesson to anyone who spends their time micro-optimizing hot functions.

jcr · on July 21, 2015

The hanshq.net story is just one of three blog entries that John Regehr linked to yesterday:

http://blog.regehr.org/archives/1242

The other two are:

http://jbeekman.nl/blog/2015/01/nibble-sort/

http://belkadan.com/blog/2015/05/Nibblesort/

serkanyersen · on July 21, 2015

This kind of blog posts are super interesting and depressing at the same time :) makes me question my programing skills. Almost all of the things mentioned here are over my head.

Thanks for sharing though, very interesting

nly · on July 21, 2015

Sometimes programming feels like: "Interesting work, useful work. Pick one". People who get to do both are privileged indeed, the rest of us sit around feeling either bored or useless.

I'm thankful at least, in contrast to several friends who suffer from depression and are not prone to nerd sniping, that, for me, these two states are almost mutually exclusive.

stephengillie · on July 21, 2015

"May you live in boring times."

simi_ · on July 21, 2015

There's an ongoing Algorithms course on Coursera right now!

https://www.coursera.org/course/algs4partI

edit: found another one, also happening now (can't believe I missed it): https://www.coursera.org/course/algo

iso8859-1 · on July 21, 2015

I have a short Python script to generate sorting networks (and diagrams like the one found in the post): https://github.com/ysangkok/python-sorting-networks/blob/mas...

I was a bit disappointed to see no code in this blog post for the algorithms used to generating the pairs. Is there are templated C++ library for this?

hansw2000 · on July 21, 2015

I don't think the network in this case was generated by an algorithm. According to Knuth, this network was discovered by M. W. Green and it's still the smallest (60 comparators) 16-input network known.

userbinator · on July 21, 2015

Upon seeing the description of the problem, my first thought was "use a sorting network" too, although I'd probably try to use fewer registers...

ggreer · on July 21, 2015

For this problem, I wouldn't worry too much about register usage. The x86-64 ISA has 16 GP and 16 FP registers, but modern implementations use register renaming[1] to give many more under the hood. For example, Haswell has 168 GP and 168 FP entries in its register file.[2] This contest was purely about speed, so one should favor using as many registers as possible when making any time-memory tradeoffs.

1. https://en.wikipedia.org/wiki/Register_renaming

2. http://www.realworldtech.com/haswell-cpu/3/

amelius · on July 21, 2015

How does it scale?

ploxiln · on July 21, 2015

These solutions do not scale at all. Even with tweaks, they could not sort 18 nibbles (or 32 nibbles). But they are very interesting, and these kinds of tricks do find some practical use in things like video codecs.

dragontamer · on July 21, 2015

But they can sort 16 nibbles twice, and then you can use merge-sort between the two sets of nibbles.

Which would scale at O(n*log(n)). Which is asymptotically optimal IIRC.

signa11 · on July 21, 2015

> How does it scale?

what do you mean ? the task asks to sort 4bit nibbles in 64bit words, so you optimize for that problem space...