Hacker News new | past | comments | ask | show | jobs | submit login
Linux Kernel RNG is now Blake2 instead of SHA1 and 3x faster (twitter.com/filosottile)
506 points by tptacek on Dec 30, 2021 | hide | past | favorite | 141 comments



This is an egregiously editorialized headline, but this was the easiest way to submit two separate links to LKML in one HN submission --- link to Fillipo's tweet about those links.

The @zx2c4 Filippo is referring to is Jason Donenfeld, of WireGuard fame. It was news to me that he'd taken over the Linux kernel random driver! But it's welcome news indeed.


I wonder if having a faster RNG will help WireGuard (in addition to many other things)


No idea, but WireGuard mostly just uses get_random_bytes in the handshake, not per-packet.


The popularity of the Blake family annoys me, as I feel that it's overhyped. Sure, they're fast (on current hardware), but they're supposed to be cryptographic hash functions. I know that BLAKE received some scrutiny in this regard, as part of the SHA3 competition. But then BLAKE was obsoleted by BLAKE2, which in turn was obsoleted by BLAKE3, making the claims to security less and less sensible.

And all this is before the actual winner of SHA3 comes into the picture. With parameters as specified by SHA3 it's a lot slower than BLAKE3, but it is my understanding (IANAC) that

(1) The SHA3 security margin can safely be decreased, as NIST set the security requirements too high. This is one of the reasons for the speed difference.

(2) The sponge-based crypto popularized by Keccak (the SHA3 winner) is much more amenable to cryptanalysis than the more traditional crypto such as the functions in the Blake family. This means that Blake's security would be less sure even had it received the same level of scrutiny. But AFAIK Blake received much less scrutiny than Keccak, or even other sponge-based functions, such as the ones in the NIST Lightweight Crypto competition.

https://csrc.nist.gov/Projects/lightweight-cryptography

https://keccak.team/2017/not_arx.html


> Sure, they're fast (on current hardware), but they're supposed to be cryptographic hash functions.

This is subjective, but in my mind the question of "how important is it for crypto to be fast?" is closely related to the question of "how many crypto options should we give people?" Like, should we design "slower, maximum-security, military-grade crypto for top secret material" and "faster, cheaper, less secure crypto for less sensitive stuff"?

I think we've all learned our lesson over the past few decades of watching real applications try to deploy crypto: Options are dangerous. This stuff is complicated enough as it is, and asking callers to choose between a bunch of alphabet soup standards that they don't understand makes things substantially worse. (AES-GCM or AES-ECB?!) It's important to be able to give clear, simple advice: Just use X.

So while it's true that most applications aren't bottlenecked by cryptography, pushing the margins of "most" benefits everyone, by putting more weight behind common standards that we know are safe.


If you want to minimize options, and yes, I know corporations with “you will use only these cryptographic algorithms” lists, I would go with SHA-2-256 for cryptographic hashing, [1] AES-192 for encryption, and RSA-2048 for public key. I would phase out RSA-2048 with whatever NIST chooses as their post-Quantium algorithm once that process ends [2], and use SipHash for the special case of choosing which hash bucket to use in hash tables. [3] I would maybe have SHA-3 on standby just in case SHA-2 ever gets broken.

[1] SHA-512/256 protects us from key extension attacks, but has much less deployment than SHA-2. We can mandate using SHA-2-256 in HMAC mode when we use it with a secret the attacker does not know.

[2] https://csrc.nist.gov/projects/post-quantum-cryptography/wor...

[3] SHA-2 is called a “hash”, as is “SipHash”, but they are different use cases. SHA-2 is generally for having a long hash where generally both the user and attacker know how to make a given string have a given hash. “SipHash” is to be used for the compression function for a hash table [4] where performance is critical, the hash is shorter, but it is protected by a secret key.

[4] https://en.wikipedia.org/wiki/Hash_function

[5] If you ask me what my favorite cryptographic hash and stream cipher is, I would say RadioGatún[32]. It has roughly BLAKE2’s speed with the added benefit of enhanced hardware performance. It was the only Extendable-Output Function (XOF) which existed back in 2007—the last time I revised which crypto primitives my own open-source software uses—and remains unbroken as we enter 2022. It know it’s hardly mainstream, but it is Keccak’s direct predecessor, and it has been analyzed by world renowned cryptographers.


You posted what you would use but did not justify it. Why not ECC? Why not AES256? Why not SHA512? Is truncating it that hard? What mode of operation?

People are very opinioned about cryptographic algorithms but more often than not they tend to be kinda clueless (not saying that this is the case here).


I chose, with the exception of SipHash, FIPS approved algorithms. As per your questions:

ECC: There was a lot of controversy because one of the FIPS (Edit: NIST) curves apparently has a back door (Edit, it was Dual_EC_DRBG), so just stick with RSA until we have standardized post-quantum public key cryptography.

AES192: If I had to choose one key size, AES192 comes off as the best compromise between speed and security; larger keys result in more AES rounds. Also: AES-256 is weaker with related key attacks than AES-192, and AES-192 is approved for Top Secret (not just Secret) information, while AES-128 isn’t.

SHA512: It uses 64-bit addition operations, which is less than optimal on certain low end embedded processors. Same reason BLAKE3 dropped the 64-bit variant BLAKE and BLAKE2 have; but I will let @oconnor663 speak on why they dropped the 64-bit version.

For a block cipher mode of operation, I would probably choose GCM, since it looks secure and is approved over at https://csrc.nist.gov/projects/block-cipher-techniques/bcm (ECB is still approved, strangely enough).


> I will let @oconnor663 speak on why they dropped the 64-bit version.

The most important thing to us was that there should be just one version. It's simpler for everyone. But we actually did go back and forth about which version it would be, 64-bit or 32-bit. "PCs and servers will all be 64-bit for the foreseeable future" was definitely a big point in favor of choosing 64-bit words. Section 7.2 of our paper goes into detail about why we went with 32-bit anyway.


Why go with FIPS approved algorithms though? Why not curve25519 which is easier to implement in a safe way compared to rsa or other curves?

"If I had to choose one key size, AES192 comes off as the best compromise between speed and security"

Did you test the speed differences? Especially when aesni is involved.


DJB’s algorithms are good. They haven’t been FIPS approved, mind you (EDIT: Curve25519, however, has been NIST approved [1]) but they are good.

If I were to make a short list of approved algorithms, I stick with FIPS, not because they are necessarily the best, but because they are standardized. Non-standard algorithms have their place—I use RadioGatún[32], of all things, for a cryptographic secure PRNG as well as for a cryptographic strong password generator—but not in a corporate context of making a “use only these algorithms” list so people don’t make novice mistakes like using MD5 to encrypt passwords or a block counter in ECB mode.

I mean, yeah, DNScurve would had been better than the DNS-over-HTTPS which won instead, but standards sometimes win simply because the standardization process for what won is perceived as being more cooperative.

As an aside, I actually like DJB’s software. I used Qmail and djbdns back in the day, and in fact I currently maintain a branch of djbdns which is, as far as I know, the only djbdns fork which is actively maintained here in the 2020s: https://github.com/samboy/ndjbdns

[1] https://csrc.nist.gov/News/2017/Transition-Plans-for-Key-Est...


So you are fine with curve25519 and chacha20, just not in "professional" tools like wireguard and tls, right?


I didn’t say that. Look, I really do have a lot of respect for DJB and his skills as both a coder and cryptographic algorithm designer. If I didn’t, I wouldn’t be using SipHash in one of my open source projects [1] nor would I be maintaining a fork of djbdns. [2]

Let’s just breathe in deeply and calm down. This is not a personal attack directed at anyone here but merely a note that emotions seem to be flairing up and I don’t think it serves anyone’s interests for this to become a Reddit or Twitter style flame war.

In cases where people know the risks, have read Applied Cryptography cover to cover, know why not to use MD5 or ECB in production code, etc. DJB algorithms are just fine. That’s the case with Wireshark and that’s the case with TLS.

What I am saying is this: In a corporate context, where you have programmers who would otherwise make novice mistakes like use simple MD5 to encrypt passwords—I’ve seen that in the real world, for the record—I would put mainly FIPS approved algorithms on a short list.

[1] https://github.com/samboy/lunacy

[2] https://github.com/samboy/ndjbdns

[3] Solar Designer did use MD5 to encrypt passwords for Wordpress about 15 years ago, but he did it in a way to minimize security risks (still secure today, though Argon2 or bcrypt are better), but that was in an era when the only cryptographic primitive PHP had was the now-broken MD5.


You think "Have read Applied Cryptography cover to cover" is a qualification for cryptography engineering? You get that there are people that actually do cryptography engineering professionally, like, on this thread and stuff, right?

It's OK to not really know all of this stuff. Other people can know it for you. The question mark comes in really handy in situations like this. It's not a challenge where you start with the distinction between FIPS and NIST and then axiomatically derive all of modern cryptography.


> Let’s just breathe in deeply and calm down.

Pretty sure they were just asking a small clarifying question.


> Also: AES-256 is weaker with related key attacks than AES-192, and AES-192 is approved for Top Secret (not just Secret) information, while AES-128 isn’t.

NSA CNSA says to use 256:

* https://en.wikipedia.org/wiki/Commercial_National_Security_A...

* https://www.keylength.com/en/6/

> […] so just stick with RSA until we have standardized post-quantum public key cryptography.

The folks that will approve the post-quantum stuff were also the ones the approved the 'pre-quantum' stuff, including the supposedly worrisome ECC standards. If you're not going to trust them now, why should you trust them in the future?


>The folks that will approve the post-quantum stuff were also the ones the approved the 'pre-quantum' stuff, including the supposedly worrisome ECC standards.

There’s a difference between something being developed by a third party and rubber stamped and made a FIPS by NIST (e.g. Rijndael which became AES or Keccak which became SHA-3) and something developed internally by NIST. The post quantum stuff is being developed outside of NIST, so there isn’t really a place for NIST/NSA to put a backdoor like they did with Dual_EC_DRBG.

SHA-2 was in fact developed behind closed doors by NIST/NSA, but since the magic constants are square roots and cube roots of primes, and since it’s been around and analyzed for over two decades, it’s extremely unlikely to have a backdoor.


> There’s a difference between something being developed by a third party and rubber stamped and made a FIPS by NIST

DES was developed by a third party (IBM, IIRC) and it was altered by NIST/NSA before its final form was approved. Turns out it was altered to maker it stronger against an attack no one publicly knew about at the time.

Things could potentially be approved that are mostly strong, except against an attack(s) that are only known in classified circles.


Ok, well, they're NIST curves, not FIPS curves, and none of them are backdoored, and it's GCM, not GSM, but do go on.


I stand corrected. It was Dual_EC_DRBG which had the backdoor. Nevertheless, it was a backdoor in a NIST algorithm using elliptic curves, so I think being a bit reluctant to use a NIST ECC curve is understandable, especially with RSA being tried and true.

GSM instead of GCM was a typo, and I believe it’s against the Ycombinator guidelines to make a correction like that a personal attack, and I believe we’re getting really close to crossing that line, so let’s just calm down and chill out.

This isn’t in any way a personal attack—I’m familiar with and have a lot of respect for your security auditing skills—but just an observation things are getting kind of heated here and it serves no one’s interests to have an out of control flame war.


So, on the one hand, you won't use NIST-approved elliptic curves because a NIST RNG used elliptic curves; on the other hand, you'd use GCM because... NIST approves it.


It’s funny that you’re a proponent of NIST/FIPS approved crypto but also worry about NSA backdoors.


The conflict of interest inherent to NSA has been a thing since data security first was a thing.

It is very rational to follow their guidelines, while at the same time being wary of backdoored products. You could have done a lot worse than picking AES during the past two decades.


If we, back in 1982, encrypted a document using triple-key 3DES, proposed in 1981, using CTR mode (1979), with a unique random key, storing the key using RSA-1024 (1977), that document would still be secure today. Even though DES came from NIST and the NSA.

Even if it’s a large document. Sweet32 really only works for CFB mode; while there are Sweet32-based attacks against CTR mode, they are a lot more difficult and often times not practical. I’m, of course, not advocating using 3DES, Blowfish, IDEA, or any other 64-bit block size cipher in a new system in 2022, but even ancient 3DES documents would still be secure today in many use cases.


CTR mode with a 8-byte block leaves so little room for a counter that, unless you use a fresh key for every encryption, you can plausibly wrap the counter.


Just for whatever it's worth, I think you're replying to one of the Blake3 authors?


Indeed. Not putting BLAKE on that list is not a criticism of BLAKE; it’s fast (especially in software), it doesn’t get faster in dedicated hardware (which, in some cases, is a good thing), and I believe it’s secure.

But, it hasn’t been standardized by NIST and consequently doesn’t have a FIPS standard. The only non-standard algorithm I put on that “use only this crypto” list is SipHash, simply because no one has ever bothered to make a FIPS standard for a fast and secure keyed hash bucket chooser.

I also like how BLAKE3 just said “we are only using 32-bit operations”, because there’s a lot of embedded places where 64-bit really isn’t an option, and because SIMD ISAs have fast 32-bit operation support on 64-bit architectures.


It's length extension attacks, not "key extension attacks". And SHA-512/256 (as you've named it here) is SHA-2.

The author of the comment you're replying to certainly knows what SipHash is.


On the other hand if crypto is "too slow" applications will more often decide to use a non-cryptographic primative instead. If it turns out that they actually did need cryptographic security, or something later starts relying on some "random" value that turns out to not be securely random then you end up with a vulnerability.

If you make cryptography "fast enough" to be used by default and without much worry than you remove the need to make this decision.

This use case is a perfect example. If you make the kernel RNG fast enough that people just use it instead of trying to make their own decision about what algorithm to use you end up with software that is more secure overall.

Plus since the kernel RNG isn't specified to use a specific algorithm so it could potentially be tuned based on your paranoia level and patched if the current algorithms are found to be insecure. If it is "too slow" less software use it and these features will be less useful.


As I understand it, the BLAKE family is built on a tweaked version of the ChaCha cipher, and security is proven under the assumption that this primitive behaves like a random function. Arguably we shouldn't be too concerned that blake3 is new, since ChaCha/Salsa are old primitives whose security has held up well.

While the sponge framework is popular, it's rather inflexible, e.g. its sequential structure permits limited parallelism. There are more general sound hashing frameworks that still offer provable security, such as this one which blake3's analysis is based on: https://www.cs.ru.nl/~bmennink/pubs/18fsenew.pdf


With parameters as specified by SHA3 it's a lot slower than BLAKE3

Keccak (SHA-3) is actually a good deal faster than BLAKE(1) in hardware. That’s the reason why they chose it: It has acceptable performance in software, and very good performance in hardware.

KangarooTwelve / MarsupilamiFourteen are Keccak variants with fewer rounds; they should smoke BLAKE2 and probably even BLAKE3 in dedicated hardware. Also, they have tree hashing modes of operation like the later BLAKE developers.

The BLAKE family is best in situations where you want the best possible software performance; indeed, there are cases where you do not want hardware to outperform software (e.g. key derivation functions) where some Salsa20/ChaCha20/BLAKE variant makes the most sense. The Keccak family is when one already has dedicated hardware instructions (e.g. ARM already has a hardware level Keccak engine; Intel is dragging their feet but it is only a matter of time) or is willing to trade software performance for more hardware performance.

Keccak code is here: https://github.com/XKCP/XKCP


> they should smoke BLAKE2 and probably even BLAKE3 in dedicated hardware

That's certainly possible, but there are subtleties to watch out for. To really take advantage of the tree structure in K12, you need a vectorized implementation of the Keccak permutation. For comparison, there are vectorized implementations of the AES block cipher, and these are very useful for optimizing AES-CTR. This ends up being one of the strengths of CTR mode compared to some other modes like CBC, which can't process multiple blocks in parallel, at least not for a single input.

So one subtlety we have to think about, is that the sponge construction inside SHA-3 looks more like CBC mode than like CTR mode. The blocks form a chain. And that means that a vectorized implementation of Keccak can't benefit SHA-3 itself, again at least not for hashing a single input. So if this is going to be provided in hardware, it will have to be specifically with other constructions like K12 in mind. That could happen, but it might be harder to justify the cost in chip area. (At this point I'm out of my depth. I have no idea what Intel or ARM are planning. Or maybe vectorized hardware implementations of Keccak already exist and I'm just writing nonsense.)


Is the SHA-3 hardware support already available in existing and widely used hardware?


Last time I looked, it’s available for ARM but there aren’t Intel ISA SHA-3 instructions yet.


A) I am not sure I'd say blake2 obsoletes blake, and certainly blake3 does not obsolete b2. They are a family, but still different algorithms.

B) blake3, afaik, is not yet as thoroughly vetted as b2.

C) I think some of the hype around b3 is deserved. It's crazy fast. If you don't care about crypto at all (e.g. file dedup), it's absolutely phenomenal. But the main downside of B3 right now is just novelty. Cryto algos are like fine wines.

D) I don't know enough about Keccak vs Blake, but just because Blake2 has received less scrutiny, doesn't mean it's insufficient. I'm sure there are tradeoffs that the kernel editors have chosen.

E: here's a link to some discussion:

https://crypto.stackexchange.com/questions/31674/what-advant...


Blake2 is already in lib/crypto in the kernel (it's needed for WireGuard, by the same author as this change); SHA3 and Blake3 are not. It's probably that simple.


BLAKE3 is crazy fast...on x86, with extremely performant implementations using SIMD instructions present.

On other platforms, where people haven't written such (or, IME, even on ARM when using their NEON version), it's nothing to write home about.


> Sure, they're fast (on current hardware), but they're supposed to be cryptographic hash functions.

Performance has nothing to do with whether a hash function is "cryptographic"

> I know that BLAKE received some scrutiny in this regard

BLAKE was one of the most, if not _the_ most, analyzed SHA3 entries.

"Keccak received a significant amount of cryptanalysis, although not quite the depth of analysis applied to BLAKE, Grøstl, or Skein"

https://nvlpubs.nist.gov/nistpubs/ir/2012/NIST.IR.7896.pdf

> making the claims to security less and less sensible.

BLAKE2/3 are not completely new algorithms. They fine tune features from the previous version to be more lightweight while still remaining cryptographically sound

> The SHA3 security margin can safely be decreased, as NIST set the security requirements too high. This is one of the reasons for the speed difference.

BLAKE and Keccak have roughly equal security margins, which is one reason why BLAKE2/3 can still be secure while being faster than BLAKE.

> This means that Blake's security would be less sure even had it received the same level of scrutiny.

So since it arguably received more, does that balance things out?

Addition being slower in hardware and much more expensive to mask against side channels is a serious argument against ARX constructions, but that isn't really relevant here


Note for those who only read the headline: the 3x speedup was achieved by removing a call to RDRAND in the extraction path, not the change in hash function.


The change in hash function did also result in speed improvements. Quoting the commit messages:

Change to BLAKE2s:

> This also has the advantage of supplying 16 bytes at a time rather than SHA1's 10 bytes, which, in addition to having a faster compression function to begin with, means faster extraction in general. On an Intel i7-11850H, this commit makes initial seeding around 131% faster.

RDRAND call removal:

> Removing the call there improves performance on an i7-11850H by 370%. In other words, the vast majority of the work done by extract_crng() prior to this commit was devoted to fetching 32 bits of RDRAND.


For such small message sizes (416 bytes, AFAICT), it's plausible that refactor-induced compiler optimizations are responsible for much of the gains over SHA-1. Note that the old code invokes (assuming 104 32-bit pool words) sha1_transform 16 times, whereas the new code calls into the blake2s library twice--blake2s_update and then blake2s_final. That suggests the new code is likely spending much more time in tight, inlined loops as I doubt the hash library entry points are being inlined into that driver.

Most of the performance metrics people cite when discussing hash functions are for large messages. For small message sizes hash initialization and finalization costs tend to dominate, and costs one might otherwise ignore, such as function calls, can become noteworthy.


"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."


The commit message criticizes SHA1 pretty well but why did they specifically choose Blake2? Donenfeld is clearly very smart but did they have a bake off?

https://lwn.net/Articles/879391/


They're already using Blake2 in the kernel and ChaCha in the LKRNG itself, and Blake2 and ChaCha share components.


When choosing a cryptographic algorithm, there are generally two schools of thought:

• Use something FIPS approved and tried and true (RSA, SHA-256, AES in GCM, maybe a NIST/FIPS curve, although Google “Dual_EC_DRBG” first, etc.)

• Use a cryptographic primitive developed by Daniel J. Bernstein (Curve25519—which, yes, is NIST if maybe not FIPS approved, [1] ChaCha20 and its derivative BLAKE2, etc.)

These are two camps, and there is some tension between the two camps (as can be seen elsewhere in this discussion) about which approach is better, with very heated emotions. While heated, as someone who remembers the BIND-vs-djbdns and Sendmail-vs-Qmail flame wars of 20 years ago all too well, [2] I find an old fashioned to-DJB-or-not-to-DJB heated discussion a refreshing blast from the past compared to the kind of toxic cancel culture, doxxing, and hideous political attacks dominating social media here in the 2020s.

For a brand new project using crypto, I would use libsodium, which is firmly in the DJB school of thought w.r.t. cryptographic algorithms used, simply because, while newer, it doesn’t have the security history OpenSSL has had, and looks to be quite secure and actively maintained by multiple developers.

[1] https://csrc.nist.gov/News/2017/Transition-Plans-for-Key-Est...

[2] I used Qmail and djbdns during that era before transitioning to my own MaraDNS. I only made MaraDNS because of the concerns about djbdns’ then-not-open-source license and the lack of open source diversity w.r.t. DNS servers back then. All moot, now that Qmail/djbdns are public domain, both NSD/Unbound and KnotDNS have come out, and Postfix has become the most common non-Sendmail email daemon. Actually, w.r.t. email, I use an online service here in the 2020s (mxroute), because the headaches of bypassing spam filters are now such it’s best left to someone devoted to just that one problem.

[3] My preferences tend to win the NIST standardization workshops. I liked Rijndael the most during the AES process because it, unlike the others, has a variable block size, so can be used for stuff like AEShash-256. I liked Keccak the most during the SHA-3 process because it was the only unbroken extensible output function (XOF) which ran decently on 32-bit hardware; the only other unbroken XOF, Skein, was pretty much 64-bit only, and BLAKE didn’t have an XOF mode at the time. I still wish BLAKE had a sponge-style “rehash to generate more random bits” XOF mode to make fast forwarding its XOF computationally infeasible.



They probably chose it because it's way more efficient?


Blake is one of the faster SHA3 finalists out there and the ‘s’ variety is fast on 32-bit hardware. I don’t know why they chose Blake2 instead of Blake3, though.


Not to nitpick but BLAKE was the SHA3 finalist, not BLAKE2 which is an improvement.

To quote Wikipedia:

>BLAKE2 removes addition of constants to message words from BLAKE round function, changes two rotation constants, simplifies padding, adds parameter block that is XOR'ed with initialization vectors, and reduces the number of rounds from 16 to 12 for BLAKE2b (successor of BLAKE-512), and from 14 to 10 for BLAKE2s (successor of BLAKE-256).

https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2

Reducing the rounds by 4 might raise eyebrows wrt security, but Aumasson's 'Too Much Crypto' discusses BLAKE vs BLAKE2 among other things: https://eprint.iacr.org/2019/1492.pdf

Why kernel devs didn't go with BLAKE3, I can't say.


Blake3 is still quite new (although it is a tested core) so I would say they made the right call.

It would only result in small incremental perf improvement due to reduced number of rounds (10 to 7).

Majority of Blake3's perf benefit manifests from merkle-tree structure and SIMD processing of multiple input streams at the same time.


> Why they didn't go with BLAKE3, I can't say.

Maybe people just liked it better that way?


BLAKE3 is written in Rust wile BLAKE2s is written in C. That is probably a lot easier to push through.

It also seems to be almost as fast according to the BLAKE3 specs PDF (page 11).

https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...


Giantrobot was going for a They Might Be Giants song reference


TMBG was a cover; It was The Four Lads originally I believe


I picked up on that reference too. I was about to type it up when I saw your comment. I think it went over a lot of heads.


Well, the singers might be giants after all.


BLAKE3 has a C implementation including many SIMD implementations. The C implementation only lacks multi-threading which BLAKE2 does not have.


It's BLAKE2, though BLAKE3 was hopeful,

but it just BLAKE2, BLAKE3 got a nope

SHA1's a long time gone, that crypto's a slowpoke

Why did BLAKE3 and SHA1 get the works?

That's nobody's business, even Turks!


claps :D


Don't. Think. Posts. Below. Got. Joke.


> Not to nitpick but BLAKE was the SHA3 finalist, not BLAKE2 which is an improvement.

To add to this: BLAKE2 is derived from BLAKE only, whereas BLAKE3 is derived from Bao AND BLAKE2.

> Why kernel devs didn't go with BLAKE3, I can't say.

https://news.ycombinator.com/item?id=29748581


Your doubly linked comment (adrian_b’s remark) makes the incorrect assertion that the only reason blake3 is faster than blake2 is parallelism; this is objectively false. Blake3 uses fewer rounds (7x2) of the Chacha primitive than blake2 (10x2).

I think tptacek has it right: this thing is used very infrequently (period 300 seconds) and blake2 was already in the kernel, so they just reused it. It’s adequately fast, even if blake3 would be marginally faster.


I do not think that he claimed that the reason is performance, but regardless, I gave other reasons besides performance, too, including "blake2 was already in the kernel, so they just reused it". I said: "BLAKE2 is already in the kernel (with tests and everything), while BLAKE3 is not (at all)".


Yes, we agree here.


Thanks for the correction.


I'm not familiar with the kernel RNG specifically, but I think the hash function is only used with small inputs and outputs, and it's the stream cipher (ChaCha) that has the more performance-sensitive job of generating lots of output bytes. So the performance differences between BLAKE2s and BLAKE3 probably aren't very important here, and the fact that a BLAKE2s implementation is already in the kernel makes this an easy change.


That makes sense, thanks. I wonder if it might make sense to use a reduced-round Chacha for generation. Aumasson calls for 8 rounds; I think Rust’s CSPRNG uses 12.


Read this comment (or thread) as to why not BLAKE3: https://news.ycombinator.com/item?id=29746938

To sum up: BLAKE3 has no advantages over BLAKE2 when it comes to the Linux kernel[1], plus BLAKE2 is already in the kernel (with tests and everything), while BLAKE3 is not (at all). That, and BLAKE3 is still fairly new.

[1] It is faster than BLAKE2 due to parallelism, which is not suitable for the Linux kernel; you want to restrict computation to a single thread instead of making all the cores busy. FWIW, the C reference implementation is not multi-threaded either.


https://www.blake2.net/ says: "CONSIDER USING BLAKE3, faster than BLAKE2, see https://github.com/BLAKE3-team/BLAKE3"

I'm glad we're on Blake2 and seeing the benefits, but why not go straight to Blake3?

What are the differences and why do they matter in this instance?


The main reason why BLAKE3 is much faster than most other hash algorithms is that it can be computed in parallel on many cores.

Inside the kernel and especially in something running continuously like the RNG, you do not want to make all the cores busy without any reason, so the hash computation should be restricted to 1 thread.

In that case, BLAKE3 does not have any important advantages.

When restricted to a single-thread, BLAKE3 is slower than hash algorithms that have hardware support on the CPU, e.g. SHA-256 on most recent CPUs. (But the Linux kernel must support many older CPUs without hardware for hash algorithms, so BLAKE2 is a better choice).


Blake3 is faster because it allows for very efficient SIMD implementations. That alone blows all competition out of the water, with a single thread. The ability to further parallelize it on multiple cores is just a cherry on top.


True, but the use of SIMD is also restricted in the kernel, so that would not matter for the Linux kernel.

In comparison with the hardware instructions for SHA-2 or SHA-3, on those CPUs which have them, BLAKE3 is not faster, because those instructions also use the same SIMD registers, processing the same number of bits per operation.

I use every day BLAKE3, for file checksumming. For this application, on my Zen 3 with hardware SHA-2, BLAKE3 greatly outperforms everything else, but only due to multithreading.

On older CPUs, without SIMD SHA instructions, you are right that BLAKE3 can be faster than other algorithms even in single-thread, by exploiting the parallelism of the BLAKE3 algorithm with a SIMD implementation.

For other hashes, SIMD may not accelerate the computation of a single hash, but when you need to compute multiple hashes you can interleave their computations and obtain similar speedups with SIMD instructions.


Isn't explicit hardware support for SHA-3 rather limited? In particular, there's none on Intel and only A13 and A14 on Apple. It can still be vectorized to a degree on other CPUs, but in that case it'll be slower than Blake3.


For now, only a few extremely recent ARM cores have SHA-3 instructions.

On the other hand, support for SHA-256 and for SHA1 (still useful for non-secure applications) is widespread, in almost all 64-bit ARM CPUs, in all AMD Zen and in some of the Intel CPUs, e.g. Apollo Lake, Gemini Lake, Jasper Lake/Elkhart Lake, Alder Lake, Tiger Lake and Ice Lake.


BLAKE3 was published in 2020. I don't think anyone would seriously use a hash function, cipher, or RNG that has been studied for only two years, let alone in the Linux kernel.


Can anyone provide links or references to current techniques and methods for performing cryptanalysis against hash functions and prngs? I frequently find myself trapped in a loop thinking a long the lines of “they’re just bits there has to be a pattern”. Border line obsessing on these things, blake2b in particular. Does anyone else have this problem?


Mostly similar to block ciphers or permutations. Differential cryptanalysis is a good starting point. Usually one starts with a reduced-round version, often a single round. Break one round, then see if that can be extended to 2, etc, etc.

The cryptanalysis of Meow Hash[1] is instructive, since it's a weak hash and actually vulnerable in full.

There's also a good Differential propagation analysis of Keccak[1] paper, using a reduced-round Keccak.

There's also Rotational cryptanalysis. Here's a paper on Keccak uses that technique.[3] It was first formally defined by Khovratovich and Nikolić.[4]

[1] https://peter.website/meow-hash-cryptanalysis

[2] https://eprint.iacr.org/2012/163.pdf

[3] https://eprint.iacr.org/2012/546.pdf

[4] https://www.iacr.org/archive/fse2010/61470339/61470339.pdf


If encrypted data has a pattern, then it is leaking data. There are some famous examples of this with block cyphers.

If the encrypted data is indistinguishable from white noise, then the CSPRNG you built on top of it can also be made indistinguishable.


No comments, just downvotes. The ECB Penguin is an example of information - patterns - leaking out of encrypted data. This is a cryptographic failure mode and the solution is a better algorithm.

https://blog.filippo.io/the-ecb-penguin/

Low entropy means predictability in the input versus the output. Properly encrypted data has high entropy, which is why you can’t compress it after encryption. Random number generators need to be high entropy, which is where the idea of CSPRNGs comes from.


Start looking at code in libsodium. Crypto is too hard for me and I just trust the libs - but I can read the code (sorta). The stuff in OpenSSL is crazy to me but libsodium I can almost understand. I feel like I can know it but not KNOW it.


Dumb question: why would a RNG need a crypto hash function? I get why it would be useful for a PRNG, but isn’t a true RNG simply supposed to gather entropy from the real world and just spit it back?


No. The high-level design of most CSPRNGs is roughly that of a stream cipher. With an ordinary stream cipher like AES-CTR, we need a 128 bit key; if we had a password or a DH shared secret, we'd get the 128 bit key from a KDF, whose core would be a hash function. With a CSPRNG, in lieu of a key we have an (often indeterminate) stream of unpredictable events --- but we still have the equivalent of a KDF, a "mixing function", whose core is typically a hash function. Adding new "entropy" means feeding it through the mixing function. Getting random bytes means running the stream cipher and spitting out the keystream directly, instead of XORing it against a plaintext.

Cryptographically speaking, the problem of taking a small secret and extracting a very large number of cryptographically unpredictable bytes is pretty fundamental to the whole project --- you know, of, like, cryptography. But for the fact that you'd like to be able to theoretically recover from a bug that discloses your random state (your "key"), you wouldn't really need more than a "seed"'s worth of entropy to run the CSPRNG indefinitely.


By the way, thanks Thomas... I've always enjoyed and benefitted from your "CSPRNGs are essentially a stream cipher, and now you finally also understand what a stream cipher really is" explanations on HN over the years. I remember the first time I grokked it from one of your comments, and it was much clearer than anything I could find on Cryptography Stack Exchange (yes, I know!).

Since then, I've used the AES-CTR trick for non-critical random generators for many user space fuzz test harnesses, and it's helped speed them up while also being stronger than a PRNG, without a repeating period.


I just shoplifted it from Bernstein.


/dev/random hasn't been a "true" RNG in the sense you're describing (an "entropy estimating" RNG I think) since Linux 5.6. However, even an entropy estimating RNG doesn't want to give raw bytes from entropy sources straight to callers, for a few reasons:

1. There might be more than one source. In that case, it's natural to want the output to be secure as long as any of the inputs are secure. For that, you need some way to securely mix two sources of randomness together, and hash functions are good at that.

2. Truly random numbers coming from lava lamps or thermometers or whatever aren't necessarily uniformly random. They might be, say, 1/3 zeros and 2/3 ones. Who knows. That's still useful randomness, but callers want uniformly random bytes, so again we reach for some sort of pseudorandom mixer like a hash function. In this case I think this step is called "whitening".


Even back when /dev/random estimated entropy, it was still a CSPRNG. At that time, /dev/random and /dev/urandom used the same cryptographic function to generate pseudo-random bits, random just added some additional fluff based on the claim that it was “estimating” randomness.


It uses it to expose part of the randomness without exposing internal state and mixing in new random events into the randomness pool.

In essence hash functions are frequently used as Pseudo Random Functions as hash functions are PRFs with input compression.


The universe does not provide bits of pure, sublimated entropy. There's a little entropy in the low bits of the interrupt timings, but you don't know exactly where it is. You need to "stir" it to get the entropy to affect the whole pool.


When you make rng good and fast you get to use it everywhere, which ends up making things more secure and robust <3


Wait, it uses the same algorithm regardless of hardware??


The actual DRNG algorithm Linux kernel uses is ChaCha20, which runs in constant time regardless of underlying hardware. BLAKE2 is below ChaCha20: it reseeds the ChaCha20 DRNG from a 4096-byte input_pool every 300 seconds.


OK, but many have SHA in hardware.


What would happen if your hardware implemented a ROT13 "engine" as its primary crypto thingie?

A responsible approach to crypto should be quite shy. It might drop it's drawers and bare all (source code) but no more. It might show some ankle towards some beefy hardware but only when escorted.

A software crypto thingie cannot and should not assume that hardware is trustworthy. However, you can stir some randomness from an untrustworthy source into your entropy pool and not find your knickers around your ankles, if you are careful.

Feel free to present your credentials.


RDRAND on AMD Ryzen 9 5900X is flakey

https://news.ycombinator.com/item?id=25733285


Why is it not SHA2 which would be faster still with hardware support?


Better question, why not use whatever hash is fastest on the system in question? We're talking about a RNG here -- it's not like we need to make sure that two different systems produce the same random numbers!


My answer to that is "maintainability". The more flexibility and moving parts we add to the RNG, the more things that can break. The delta from "one hash function" to "two hash functions" would involve not just adding a second hash function, but also a bunch of configuration code, target-specific logic, fallback handling, etc. There are plenty of places in the kernel that need to care about this kind of logic, but I don't believe the RNG is one of them, and I don't particularly want to require the RNG maintainers to spend their time caring about it.

Additionally, it's not just the hash function that would matter for speed here, but also the expansion. Linux RNG uses ChaCha20 for that, so if you were going all-in on target-specific speed, you'd need additional logic for swapping that out for a hardware-accelerated cipher (probably AES, which would introduce even more considerations given that it has a 16-byte block size, vs ChaCha20's 64-byte blocks).


The title mentions performance, but it is not the primary motivation AFAICT. It is only mentioned to say “it is not slower”.

The main concern was security, so it makes sense to use BLAKE2, which benefits from existing cryptanalysis of the ChaCha20 permutation, which is already used in the RNG for number generation.

(And it makes sense to use BLAKE2s in particular, to support non-64-bit systems without penalty.)

Using a single hash (instead of picking one at runtime) simplifies the attack surface IMO.


The argument here is that many (most?) Linux systems have access to a hardware SHA2, which is equally secure (in this setting) but faster. Attacks on SHA2 or Blake2 (really, of SHA1, for that matter) aren't how viable real-world attackers on the LKRNG are going to happen. It's good to see SHA1 getting swept away for other reasons though!


Do most Intel (and by extension, Linux) machines have SHA2, though? I think it’s a pretty recent extension and at least initially, they were only shipping it in their low-end embedded models.


Most Intel processors don't have the extension, it is only Goldmont (low power) and 10th gen and up.

All AMD Zen processors have the extension.


See, what do I know. Definitely not "most".


Are you sure? I haven’t been able to buy a CPU that doesn’t have SHA2 acceleration for an number of years now.


> I haven’t been able to buy a CPU that doesn’t have SHA2 acceleration for an number of years now.

This is incorrect. Intel only launched their 11th gen desktop processors March 30, 2021. The 10th gen and earlier desktop processors do not have the SHA instructions. You can still buy a new i9-10900k from Newegg today.

(Note that 10th gen Intel mobile/laptop processors are a different micro-architecture, and do support SHA.)

Edit: Perhaps you're thinking of the AES instructions? They've been around a lot longer.


I think I am. Thanks for the correction.


The argument that not all CPUs running Linux have hardware SHA2 is valid, and therefore can't be assumed. However saying because one (even a significant one... though it's arguable it hasn't been the majority for a while) doesn't and therefore it shouldn't be used seems shortsighted at best. For decades various minority features have been enabled in the Linux kernel. Since when is lowest common denominator the desirable target to utilize exclusively?


I am not making the argument you're rebutting. I was only responding to this claim:

> The argument here is that many (most?) Linux systems have access to a hardware SHA2


Maybe this is a silly question, but why should RNG even be part of the kernel in the first place? It's convenient having it in a device file, but why couldn't that be provided by some userspace program or daemon?


Two good reasons:

1. The kernel has access to unpredictable events that make good key fodder for the CSPRNG itself, which would be more annoying and less efficient to percolate into userland.

2. The kernel can assure all the processes on the device that they're getting random bits from a known source of unpredictable bits, and refuse to deliver those bits if its state hasn't been sufficiently populated; this has been a recurring source of systems vulnerabilities in programs with userland CSPRNGs.

To that, add the convenience factor of always knowing how to get random bits and not having to set up userland dependencies and ensure they're started in the right order &c.

You should generally use your kernel's CSPRNG in preference to other ones, and break that rule of thumb only if you know exactly what you're doing.


1. The kernel has much more access to sources of indeterminism than a userspace application does. Things like disk seeks, packet jitter, asynchronous interrupts, etc. provide lots of "true" entropy to work with. Userspace programs, on the other hand, have very deterministic execution. In fact, the only way to introduce true indeterminism into a userspace program is to query a kernel-mediated resource (e.g. system call, shared memory mapping, etc.), or to invoke an unprivileged and unpredictable hardware instruction, of which there are very few (e.g. RDRAND on x64, LL/SC on ARM).

2. Userspace programs cannot be made as robust to accidental or malicious failures. Even if you have a userspace RNG daemon that randomly open files or sockets to extract entropy, what happens if that daemon crashes? Or it fails to open a file or socket? Or an exploit gets an RCE into the daemon to read privileged files? By contrast, the kernel is already performing all these operations for userspace processes, so it might as well measure those things and stick the results into its own entropy pool to hand out to other processes on request.


The kernel needs randomness too. If it's done there right once, why duplicate the effort in userspace (and watch all the hilarious ways userspace solutions fail)?

getrandom was motivated by the fact that using device files has many failure modes.


Oh, nice, a good 3rd security reason! A userland CSPRNG is essentially just an additional single point of failure, since you're going to end up depending on secure randomness in the kernel already.

It's a good question!


Not arguing, but genuinely curious - what do kernels need randomness for?


A lot of bits of security rely on some level of non-determinism. Things like TCP initial sequence number generation, where every TCP connection sequence starts with a random number. There have been numerous attacks where the RNG wasn't good enough, so an attacker could determine the TCP sequencing and perform various malicious activities. Additionally, things like in-kernel VPNs, like IPSEC and WireGuard, also need RNGs for their own internal functions. Calling out to userspace for that would be painful and could potentially break in a lot of unexpected ways.


I don't recall exactly but i think TCP retransmission delay and handshake adds random to the backoff, to avoid a thundering herd situation which repeats again if all clients retry at same time.

Assigning free ports to applications that listen on a socket is also random, not sure why, feels like it could be sequential unless you want to deliberately obscure what ports are being used.


Address Space Layout Randomisation (ASLR) for one. ASLR moves around key parts of an executable to make exploitation of certain types of vulnerabilities more difficult or impossible.


The kernel can keep secrets from user space, which is necessary for maintaining a secure RNG state.

The kernel also has the hardware access that is used as entropy sources. If the RNG was in user space the kernel would have to provide some way of securely exporting that entropy to user space. It is simpler and more secure to just export the end result of random numbers through a simple API.

All modern OS have made the same decision of having a kernel-based CSRNG, for the same reasons.


Not a silly question at all. It's because having a reliable and uncompromised source of randomness is essential for cryptographic applications. Having the RNG in user space would make it more vulnerable to attack.


I normally care about reproducible rng results and do try to seed all rngs used. There are lots of applications where randomness is used but you still want repeatable behavior. ML experiments is one situation where randomness is common, but is also intended to be repeatable.

I think python standard library/several of data science library rngs are platform independent.


This is not that kind of RNG; this is the system secure random number generator. Secure random numbers aren't seeded.


I wouldn't say the lack of a seed is the difference between a CSPRNG and a PRNG. That's more the difference between a CSPRNG and a stream cipher, where the "seed" is called a "key" and "IV".

PRNGs (Pseudo Random Number Generators) are predictable. CSPRNGs (Cryptographically Secure Pseudo Random Number Generators) aren't (if they're working). HWRNGs (Hardware Random Number Generators) that aren't debiased via a CSPRNG or similar produce nonuniform output (not suitable for cryptography or most other uses directly). TRNGs (True Random Number Generators) might not exist in this universe (deterministic interpretations of quantum mechanics are consistent with observation), it's safer to assume they don't and avoid the term entirely.


I think you are talking about a psuedo-rng.


Basically every RNG you interact with on general purpose computers is a pseudo random number generator. The kernel RNG being discussed here is a CSPRNG, as was the one it replaced.


AFAICT, BLAKE2s (previously SHA-1) is only being used for the forward secrecy element, in this case mixing a hash of the pool back into the core state, which is actually still using ChaCha20 for expansion. From quick inspection (never read this code before) I think the number of bytes in the pool is 416 (104 * 4). (See poolwords defined at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...) For such relatively small messages and based on the cycles-per-byte performance numbers from https://w3.lasca.ic.unicamp.br/media/publications/p9-faz-her... (SHA-NI benchmarks), https://www.blake2.net/blake2_20130129.pdf (BLAKE2 paper), and https://bench.cr.yp.to/results-hash.html (comprehensive table of measurements), I don't see any performance reasons for choosing BLAKE2s over SHA-256. Rather, software SHA-256 and BLAKE2s seem comparable (and that's being charitable to BLAKE2s), and SHA-NI is definitely faster.

Perhaps there were other considerations at play. Maybe something as simple as the author's preference. One thing that probably wasn't a consideration is FIPS compliance--the core function is ChaCha20 so FIPS kernels require a completely different CSPRNG, anyhow.


https://github.com/BLAKE3-team/BLAKE3/blob/master/media/spee... argues BLAKE2s is twice as fast compared to SHA256.

One aspect switching from SHA1 to BLAKE2s does is it increases the total entropy a single compression operation adds to ChaCha20. This increases speed when folded BLAKE2s adds 128 bits per operation instead of folded SHA-1 that adds 80 bits. So that's two calls instead of four (I'm assuming they kept the folding). Another speedup comes from the fact the hash function constants aren't being filled with RDRAND inputs for every call.

Finally, I'm not completely sure if increasing the hash size itself adds computational security against an attack where the internal state is compromised once, and the attacker tries to brute force the new state based on new output; My conjecture is the reseeding operation is atomic, i.e. that ChaCha20 won't yield anything until the reseed is complete. There shouldn't thus be any difference in this regard. I'd appreciate clarification wrt this.


> argues BLAKE2s is twice as fast compared to SHA256.

That's for 16KiB inputs.

> One aspect switching from SHA1 to BLAKE2s does is it increases the total entropy a single compression operation adds to ChaCha20. This increases speed when folded BLAKE2s adds 128 bits per operation instead of folded SHA-1 that adds 80 bits.

But the question was why BLAKE2s instead of SHA-256, not SHA-1. SHA-256 has the same digest length as BLAKE2s.


> That's for 16KiB inputs.

BLAKE3 needs 16 KiB of input to hit the numbers in that bar chart, but BLAKE2s doesn't. It'll maintain its advantage over SHA-256 all the way down to the empty string. You can see this in Figure 3 of https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak.... (BLAKE3 is also faster than SHA-256 all the way down to the empty string, but not by as large a margin as the 16 KiB measurements suggest.)

On the other hand, these measurements were done on machines without SHA-256 hardware acceleration. If you have that (and Intel chips from the past year do), then SHA-256 does a lot better of course.


Ah, the low eBASH numbers must be for SHA-NI. Looking at older CPUs SHA-256 ends up being about 2x slower than BLAKE2s at 576-byte message sizes.


>That's for 16KiB inputs.

Oh that's a good point.

>But the question was why BLAKE2s instead of SHA-256, not SHA-1. SHA-256 has the same digest length as BLAKE2s.

Two things come to mind. Firstly, does it really matter to speed? The reseeding interval of ChaCha20 DRNG (i.e. BLAKE2 call frequency) is 300 seconds and it runs in the order of milliseconds. Best bang for buck at this point would result from ChaCha-NI.

Secondly, there's the aspect of reducing reliance on an algorithm that suffers from length extension attacks. While LRNG itself doesn't directly benefit from BLAKE2s's indifferentiability, it helps in phasing out SHA-2 which is less misuse-resistant, and that might be misused elsewhere.

(Finally, no more pointless flame wars about "An algorithm created by the NSA is being used in the LRNG!!")


> The reseeding interval of ChaCha20 DRNG (i.e. BLAKE2 call frequency) is 300 seconds and it runs in the order of milliseconds.

I would guess microseconds.


Because Hardware support is not available in alot oft devices and then blake2 beats sha2.


I wonder if there are more virtual machines than bare metal hosts. There must be, right?

Even with paravirtualization as an option I feel like many users of VMs would opt for the advantages of full virtualization.


My vague understanding was that KVM-accelerated virtualization still let you use Virtio RNG, although I guess I don't know what the relative performance looks like at that point.


Virtio-rng is good for getting quality seed material into the guest, but you don’t want to use it for bulk generation. Some hypervisor environments impose arbitrary and low throttles on guest virtio-rng use (think kB/s).


Speed was probably not the criteria otherwise blake3 seems like an even better choice:

https://github.com/BLAKE3-team/BLAKE3


I think speed is definitely a big part of this, though the speedup comes primarily from getting rid of superfluous calls to RDRAND. Blake2s is already in the kernel (after a fashion) for WireGuard itself; I don't think Blake3 is. An additional nerdy point here is that the extraction phase of the LKRNG is already using ChaCha (there's a sense in which a CSPRNG is really just the keystream of a stream cipher), and Blake2s and ChaCha are closely related.


So should we expect a blake3 switch sometime in the future? It seems to be a refinement of blake2 to make it more amenable to optimization while keeping most (all?) its qualities. Being well suited for optimization across architectures would also make it ideal for the kernel and it seems the reference implementation has already done a lot of the heavy lifting.


I doubt it, but who knows?


Blake3 will only result in small incremental perf improvement due to reduced number of rounds (10 to 7).

Majority of Blake3's perf benefit manifests from merkle-tree structure and SIMD processing of multiple input streams at the same time.


> multiple input streams at the same time

Just in case this isn't clear, BLAKE3 breaks a single large input up into many chunks, and it hashes those chunks in parallel. The caller doesn't need to provide mulitple separate inputs to take advantage of the SIMD optimizations. (If you do have multiple separate inputs, you can actually use similar SIMD optimizations with almost any hash function. But because this situation is rare, libraries that provide this sort of API are also rare. Here's one of mine: https://docs.rs/blake2s_simd/1.0.0/blake2s_simd/many/index.h....)


[flagged]


Jason is very well known, already has code in the kernel and is touching a _highly_ visible part of the code.

That's a pretty bold statement on your part, especially because there are many other, simpler, mechanisms that can be used for subverting/back dooring without such costs




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: