If backdoored constants are easy enough to find that you are able to find them AND they are appropriate English phrases then the cipher itself is likely broken doubly so as the origional version of chacha had 2 key sizes 16-byte and 32-byte and each used the applicable constant, so in this case you would have had to have found 2 matching back doored constants.
That being said you aren't crazy this is what the nsa was accused of doing with elliptical curves though they had started with inexplicable random seeds.
It doesn't really matter. The key thing is that the constant bytes prevent a symmetry from forming in the block. So it matters that the string does aid such symmetry. Making it a text string is just a flourish helpful to us humans. It makes it easier to see the lack of symmetry and trivially answers the question, "why those bytes?"
That's a good cite but really mostly relevant to curves. If a cipher design had any of this kind of flexibility with regards to its parameters or inputs, nobody would use that cipher.
No, it does not. You're taking this notion of constrained values out of context. ASCII strings in crypto constructions are very common; for instance, they provide domain isolation in hash constructions (where you have a single hash function applied to inputs of different sensitivity, and want to mint multiple logically unrelated hash functions from the one you have). They're also common in versions.
The ASCII we're looking at here is conceptually a hash input. It's not a part of the design of the hash core itself.
This theory probably contradicts the usage of block cipher in counter mode. With them, the input block is usually full of zeros. This constraint has strengthened ciphers in the past. DJB wrote somewhere that some block cipher were broken in CBC mode (all the bock can be controlled by the attacker), while CTR mode was still safe (many bits forced to zero).
If you enjoyed the style of this article, I would also recommend you have a look at the brilliant explanation of Earley parsing by the same author: http://loup-vaillant.fr/tutorials/earley-parsing/
How does ChaCha20 compare to the established AES standard? Is it stronger? weaker? faster? slower? easier to implement correctly? harder to implement correctly? better for some other reason? worse for some other reason?
* As an ARX design, doesn't need S-boxes, and so doesn't leave a cache footprint
* Has free key setup
AES is:
* A global standard
* Available in hardware on most platforms (extremely important)
* A conventional block cipher for which a bunch of modes (in particular: wide-block and AEAD) are already defined
But unlike Salsa, AES:
* Has relatively complicated key schedule (you have to expand its key input to a series of per-round keys, which imposes a cost when you switch keys)
* Relies on S-boxes for security and so must carefully avoid microarchitectural side channels
* Is much harder to implement
* Is not a native stream cipher, so requires an adapter (usually: GCM mode) to use safely.
AES is usually faster on modern systems because it's implemented directly in silicon. Salsa is usually the fastest pure-software option. Both are so fast that the speed difference is not particularly important, but most systems will prefer AES when hardware support is present.
Salsa is almost certainly the better choice for new designs just because of its simplicity. It's harder to screw up Salsa20 or its derivatives than it is to screw up AES (it is very easy to screw up AES), and its performance is more than satisfactory.
Even then, they actually used a tweaked version of ChaCha20 that uses a 96-bit nonce (just barely large enough to be suitable for randomly-generated nonces) and a 32-bit counter (limiting its use to 128GiB for a given nonce). Also, an extension XChaCha20 was recently published which performs an extra 20 rounds to initialize the cipher state, allowing for 192-bit nonces with no corresponding reduction in counter size.
Yes, but that's true of all sorts of things that aren't really global standards. Don't get me wrong: you should use Salsa ciphers. I'm just trying to provide the most honest possible accounting.
AES is hard to implement on a general purpose computer in a way that is both fast and doesn't leak through cache timing attacks.
The safe way to use AES is by using a hardware implementation, like modern x86 and some ARM CPUs.
The best software implementations use bitslicing and SSE, but are still slow. The best I saw is an Emilia Kasper and Peter Schwabe paper[1] from 2009 on bitsliced AES-GCM has 21.99 cycles/byte performance for constant-time implementation authenticated AES-GCM.
For comparison, Intel shows[2] 0.77 cycles/byte for same with a hardware implementation, albeit on a newer CPU.
Chacha is fast on modern general purpose CPUs without the need for a hardware implementation of chacha. One reason it's fast is that it was designed so that a normal compiler can generate machine code from regular-looking C code in such a way that it uses vector (wide) registers and uses independent operations to use as many operations in the CPU in parallel in the same clock cycle, without requiring an assembly wizard to do that. Intel can afford assembly wizards (i.e. Shay Gueron), other people can't.
Modern TLS stacks prefer AES when running on a CPU that has AES hardware and fallback to chacha otherwise. They of course fallback to either a slow or an insecure implementation of AES if the other side doesn't support chacha.
Basic Chacha C implementations do not get auto-vectorized down to ultra-efficient code. The most efficient implementations are intrinsic/assembler that process 4 (SSE2/AVX/NEON) or 8 (AVX2) Chacha20 blocks at once. This is due to layout of variables and operations being designed for efficient SIMD use and the blocks being independent of each other. (Shay's Chacha20 implementation is also not the fastest!)
Basic GNU C implementations don't get auto-vectorised full stop. But with a little bit of effort Chacha20 can be made to vectorise. The implementation in here is vectorised by GNU C:
If "ultra-efficient code" means what could be produced by a programmer highly skilled in some amd64 implementation (intel core2, amd bulldozer, ...) for that implementation then yes I doubt GNU C produces it. But the odds are GNU C's output runs faster than that's guru's code on other amd64 implementations.
That Salsa implementation is not being vectorized? Salsa also requires some values to be shuffled around to actually work in SSE registers, djb made a bit of a boo-boo when designing it. Chacha fixes that, so its SIMD implementations are a bit more straightforward.
If you don't have HW-acceleration, for example AES-NI instructions in x86-64, ChaCha will normally be quite a lot faster. Esp on 32-bit and 64-bit architectures.
Being a stream cipher you can also precompute the keystream. This reduces encrypt/decrypt to a simple XOR when handling the message - depending on message length of course. And yes, AES-CTR can also be used like this.
Chacha20 can do random access. See the end of my article, when I talk about counter mode. To get the part of the stream you want, you just generate the block you need (they're all the same, only the counter changes), then encrypt it. No need to generate all previous blocks.
Indeed, one reason for using AES in counter mode is this random access, which among other things enables parallel encryption. The same strategy works with Chacha20.
I'm probably stating the obvious here, but whatever your strategy for decrypting is you still must verify the ciphertext integrity, which unfortunately for you is calculated on the whole ciphertext. You may win some time by not reading the stuff before the block you're interested, but you will have to read the whole stuff anyway if you want to be safe.
I'm no expert of course so I don't even know if there's an AEAD that can bring you integrity on parts of the input; at least I know that minilock (https://github.com/kaepora/miniLock/blob/master/README.md#-m...) builds some kind of counter mode where each chunk is properly encrypted and has everything needed to check its integrity.
The most widespread way of using Salsa/ChaCha is in the "Chapoly" construction, which combines ChaCha20 with DJB's Poly1305 polynomial MAC; this is an authenticated construction. Pretty much every mainstream application of Salsa20 is in fact a Salsa/Poly1305 construction.
You can also just combine Salsa and HMAC.
It's true that you need to authenticate your data, but this is true for any cipher that you use.
It's a bad idea to implement your own cipher code, no matter what you're doing. If you're looking to include Salsa/ChaCha in an application, use Nacl, which refuses to give you unauthenticated ciphertext.
There's virtually no difference in utility, since pretty much the only thing we ever do with a block cipher is adapt it to encrypt streams --- this is true conceptually even when we're not literally turning the block cipher into a PRF with something like CTR mode.
This is solid but might miss the forest for the trees.
If you want to understand Chacha/Salsa, the best way to start is that it's an ARX-basedhash function, keyed, running in counter mode.
ARX stands for addition, rotation, and XOR, which are the three operations ARX designs are composed of. Addition is nonlinear in the context of ARX, which eliminates the need for S-boxes (or complex alternatives to S-boxes) and makes it easier to build constant time crypto. You can make any function out of just A, R, and X (technically, any function out of A and R, but less efficiently). A good place to start understanding why you want rotation and nonlinearity is the Wikipedia page for SP Networks:
Another good bit of background is the old idea of iterated ciphers (round functions run repeatedly, rather than one giant cipher function), and the slightly more modern idea that it's better to have a very simple round function you repeat a lot than a complicated round function you run fewer times. When you get this you can start grokking design decisions in terms of how many rounds they shave off your design to achieve the same security (which also might get you back to why you have X in addition to A and R).
The Salsa20 core is a very simple hash function designed to be fast and flexible for multiple constructions. Bernstein designed the stream cipher we all know based on it, and also Rumba20, which is a more tradition collision-resistant cryptographic hash. Designing ciphers out of hash functions has been a research interest of Bernstein's since the 1990s, when hash functions were approved for export but ciphers not.
A keyed hash (also: PRF) is a hash function that takes a secret key input. In a general-purpose hash like SHA3, you provide the key along with the rest of the input by simple concatenation. Fun true fact: in SHA-2 and hashes before that, it was unsafe to do this, which is why we have the HMAC construction, which SHA-3 and Blake2 obsolete. At any rate, Salsa20 takes the key as a special parameter and encodes it into a block.
Counter mode is conventionally a method for turning a block cipher into a stream cipher. In 2017 its widely seen as the most important and primary way you should use block ciphers (if you're not using an AEAD, most of which are built on counter mode in some way). Counter mode is super simple: you encrypt a counter of some sort and XOR the resulting block with your plaintext. To decrypt, you do the same thing.
A really good place to start learning about Salsa20 (and thus Chacha20) is Bernstein's design paper, which is extremely readable and easy to skim:
> A good place to start understanding why you want rotation and nonlinearity is the Wikipedia page for SP Networks:
I guess the real story requires knowing a little bit about linear and differential cryptanalysis, which are conceptually quite simple in their genesis, from a mathematical perspective.
XOR and n-bit addition are both forms of addition over different finite fields, GF(2) and GF(2^n). Multiplication in GF(2) is AND, so any linear function on a vector space over GF(2) is some kind of "masked parity" function, with functions only distinguished by their mask.
You can back-solve for inputs given enough independent outputs using Gaussian elimination and other standard linear algebra algorithms. Linear cryptanalysis is based on finding combinations of output bits that behave close enough to linear as a function of input bits to make this kind of strategy yield usable information. That is, just as in computational mathematics more generally, we approximate non-linear functions by linear functions and apply linear algebra techniques to the linear functions.
Differential cryptanalysis is the same general idea but with GF(2^n) as the scalar field instead of GF(2). If f is a linear function then f(x + y) = f(x) + f(y) and f(a x) = a f(x), so it's likewise true that f(x - y) = f(x) - f(y) by taking a = -1. That is, reading this last equation backwards, if f is truly linear then for any pair of vectors x and y with the same difference x - y, we should expect f(x) - f(y) to have the same exact value. If f is an encryption function (assume the key is baked into it), then all we have is f(x) and f(y), so we can't compute f(x - y) without knowing x and y, but we can certainly try to feed lots of plaintext pairs x and y with the same difference and see how the differences f(x) - f(y) of their ciphertexts relate to each other. If we can find a large family of plaintext pairs that have nearly the same difference in ciphertexts (by an appropriate measure of "nearly"), then this reveals an approximate linearity in the encryption function, and at that point we're back to being able to use linear algebra techniques to gain information about f and hence the key baked into f.
ARX attempts to foil such techniques by mixing both XOR and addition, which would individually create linear functions over their respective fields, but in combination help a little bit to break up the linearity over both finite fields. And the R in ARX is bitwise rotation, which is actually linear over GF(2) vector spaces (it's just a permutation of the vector's entries) but strongly nonlinear over GF(2^n) vector spaces.
Are there any concrete, simple examples showing linear and differential cryptanalysis (simple, breakable cipher + example cracking program)? As much as I've studied the theory and perused the design decisions of modern ciphers to avoid such attacks, I've never taken the time to sit down and actually crack a simple cipher using them. Would be neat to do so.
The starter exercise labeled 6.2 is a good way to get your feet wet with the ideas I described. 12-round DES without any S-boxes consists of P-boxes (permutations) and XORs, which are both linear over GF(2) vector spaces, so it's a linear block cipher and hence trivially breakable with any linear algebra package. RC5 without rotations is not exactly linear over either GF(2^n) or GF(2) since it mixes XORs and (mod 2^n) additions, but the combination is only very weakly nonlinear (there's not enough avalanching from the carries to entangle entries that are far apart), and therefore a good demonstration of why you need rotations in ARX to introduce rapid long-range bit entanglement. And in case it wasn't already obvious, the exercise about RC5 with rotations by a round number will show you why the rotation amount in ARX should be relatively prime to the bit width. Otherwise you end up with disconnected rotation orbits where the round function only mixes within a given orbit. In the extreme case where the rotation amount is half the bit width, each orbit contains at most two elements, so it's hardly any better than no rotation at all.
I bet there are also modern textbooks in cryptanalysis with exercises and a more hand-holding approach. Maybe any cryptographers reading this could recommend something.
I should also note that the picture I painted is most applicable to block ciphers, but it does apply mutatis mutandis to hash functions and pseudo-random generators.
With hashing, you have an inherent loss of information and hence the linear functions in question are non-invertible. Linear hash functions can be expressed as a factorization into two parts, an invertible function followed by a projection onto the first n bits (for a hash function producing n bits of output). With hash functions you're typically trying to find collisions or first and second preimages. Both of these are simple for linear functions. Let me just paint the picture for collisions. If f is linear then its kernel ker(f) = {x | f(x) = 0} can be efficiently calculated by Gaussian elimination. Then you can crank out collisions like no-one's business: if k is in ker(f) then f(x + k) = f(x) + f(k) = f(x). In practice, you're not going to find perfectly linear hash functions in the wild, but if you can detect an approximate linearity on some subspace, you can calculate the kernel of the linear approximation and use that to generate perturbations (the k from earlier) for a randomized collision search with a much higher likelihood of success per perturbation than random chance.
For pseudo-random generators, you're usually trying to solve for the PRG's internal state from a sequence of outputs. The generator function for a PRG takes its current internal state and produces the new internal state and an output. Incidentally, this is a nice completion of the triangle of cryptographic primitives. With block ciphers, you had invertible (injective and surjective) functions. With hash functions, you had non-injective functions (fewer output bits than input bits). With pseudo-random generators, you now have non-surjective functions (more output bits than input bits). While we cannot see the private state output from the generator, we do have multiple examples of the public output. So if (x_n, s_n) = g(s_(n-1)) is the generator equation, then in an ideal case we have the sequence of equations (x_1, s_1) = g(s_0), (x_2, s_2) = g(s_1), ..., (x_n, s_n) = g(s_(n-1)), where x_1, ..., x_n are known to us. If we assume g is a linear function that is known to us as well (no security through obscurity), then this is just a system of linear equations. For a generator with a maximal period, if we have as many bits of output as there are bits of internal state, we may solve uniquely for the initial state s_0, and from there we can calculate s_1, s_2, etc, by just replaying g starting with s_0. As in our previous examples, things in the wild aren't perfectly linear, but it's enough for PRGs to be approximately linear to leak bits of internal state, though we usually need far more example bits of output than there are bits of internal state to make a dent.
If you're interested in some reading about ARX systems, a good paper is Rotational Cryptanalysis of ARX[1]. It's quite readable. If you follow the citations of that paper [2], you can find some interesting stuff.
Thank you. Adding a quick reference to ARX designs right away.
Edit: added these paragraphs:
Quick summary: Chacha20 is ARX-based hash function, keyed, running in counter mode. It embodies the idea that one can use a hash function to encrypt data.
The expert will immediately notice this quarter-round is an ARX based design (ARX stands for, Addition, Rotation, Xor) which despite its simplicity can be made as good as a regular permutation-substitution network.
The author seems to favor XChaCha20 over ChaCha20, even though XChaCha20 is not part of any formal standard or any widely know paper. [1] It would be interesting to know what DJB (the author of ChaCha20) thinks about XChaCha20 and related variants.
I took some liberty here, but this is a straightforward derivation of XSalsa20. I trust XChacha20 just as strongly as I trust Chacha20.
The same cannot be said about my implementation however. I haven't found test vectors for XChacha20, so for now I have to rely on code review (only my own eyes so far) to ascertain its correctness. Not ideal.
Also I don't favour XChacha20 over Chacha20, because it is slightly slower to initialise. If a 64-bit nonce is enough, I'll use Chacha20. Maybe I should make this clear in my article.
ugh, an extra ietf variant that pads the remaining 64-bit
from the nonce to fit 96 bit thats incompatible with all
the implementations out there... :-(
a += b; d ^= a; d <<<= 16;
c += d; b ^= c; b <<<= 12;
a += b; d ^= a; d <<<= 8;
c += d; b ^= c; b <<<= 7;
I'm curious: could a C compiler look at the quarter-round function above and determine that a+b (or the other terms) might overflow a 32-bit integer, and thus invoke undefined behaviour to eliminate the loop entirely?
No, unconditionally safe. The C standard exactly defines unsigned overflow while specifically leaving signed overflow undefined.
(Pedantically timing is always a crapshoot in C, a compiler only need produce the same results as the abstract machine. It could freely take all your secret data and modulate it into the timing and be conforming. -- but considering that intel/amd won't make timing promises about the instructions themselves...)
for (int i = 0; i < 32; i++) // safe branch
diff_bits |= x[i] ^ y[i];
then no, it doesn't leak, because the result of the resulting conditional branch doesn't depend on a secret. The only reason NaCl unrolls that loop is because neeed moar speeed.
If you were talking about the early return straightforward for loop:
for (int i = 0; i < 32; i++) // safe branch
if (x[i] != y[i]) // timing leak...
return -1; // ...magnified
Then yeah, it leaks.
> Longer term worry is the optimizer will figure the above out as well.
It may, but even compiler implementers realise the value of constant time code. Replacing this code with an early return doesn't just require very sophisticated optimisations, it never happens outside of a crypto library. There is little incentive for compiler writers to do this.
Problem is the optimizer is totally free to implement the 'safe' code snippet using an 'unsafe' early return. According to the standard that would be completely legal.
A compiler that did this optimization would immediately introduce a pragma guaranteed to result in constant time memcmp from some blessed source pattern.
Maintainers of crypto libraries inspect the assembly when upgrading their compilers, test with many compilers and document the versions of compilers they support.
Even if it were undefined behaviour, it would still only be conditionally undefined. A compiler can only eliminate a code path that unconditionally results in undefined behaviour.
Even if the numbers were signed (unsigned overflow is well-defined), being able to eliminate the loop would require knowing that overflow MUST occur, not that it MAY occur.
I am not sure this matters?
I mean, facebook managed to get a reasonably nice .onion routing id (facebookcorewwwi.onion) by bruteforcing stuff right?
I can imagine bruteforcing the "backdoor key space" to find something that looks good, am I insane?