Note for those who only read the headline: the 3x speedup was achieved by removi...

str4d · on Dec 31, 2021

The change in hash function did also result in speed improvements. Quoting the commit messages:

Change to BLAKE2s:

> This also has the advantage of supplying 16 bytes at a time rather than SHA1's 10 bytes, which, in addition to having a faster compression function to begin with, means faster extraction in general. On an Intel i7-11850H, this commit makes initial seeding around 131% faster.

RDRAND call removal:

> Removing the call there improves performance on an i7-11850H by 370%. In other words, the vast majority of the work done by extract_crng() prior to this commit was devoted to fetching 32 bits of RDRAND.

wahern · on Dec 31, 2021

For such small message sizes (416 bytes, AFAICT), it's plausible that refactor-induced compiler optimizations are responsible for much of the gains over SHA-1. Note that the old code invokes (assuming 104 32-bit pool words) sha1_transform 16 times, whereas the new code calls into the blake2s library twice--blake2s_update and then blake2s_final. That suggests the new code is likely spending much more time in tight, inlined loops as I doubt the hash library entry points are being inlined into that driver.

Most of the performance metrics people cite when discussing hash functions are for large messages. For small message sizes hash initialization and finalization costs tend to dominate, and costs one might otherwise ignore, such as function calls, can become noteworthy.

xarope · on Dec 31, 2021

"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."