Note for those who only read the headline: the 3x speedup was achieved by removing a call to RDRAND in the extraction path, not the change in hash function.
The change in hash function did also result in speed improvements. Quoting the commit messages:
Change to BLAKE2s:
> This also has the advantage of supplying 16 bytes at a time rather than SHA1's 10 bytes, which, in addition to having a faster compression function to begin with, means faster extraction in general. On an Intel i7-11850H, this commit makes initial seeding around 131% faster.
RDRAND call removal:
> Removing the call there improves performance on an i7-11850H by 370%. In other words, the vast majority of the work done by extract_crng() prior to this commit was devoted to fetching 32 bits of RDRAND.
For such small message sizes (416 bytes, AFAICT), it's plausible that refactor-induced compiler optimizations are responsible for much of the gains over SHA-1. Note that the old code invokes (assuming 104 32-bit pool words) sha1_transform 16 times, whereas the new code calls into the blake2s library twice--blake2s_update and then blake2s_final. That suggests the new code is likely spending much more time in tight, inlined loops as I doubt the hash library entry points are being inlined into that driver.
Most of the performance metrics people cite when discussing hash functions are for large messages. For small message sizes hash initialization and finalization costs tend to dominate, and costs one might otherwise ignore, such as function calls, can become noteworthy.