A Go implementation of Poly1305 that makes sense

zimmerfrei · on April 3, 2019

One reason this cannot be translated into a portable C version while retaining its compactness (and readability) is the lack of support for carry/borrow chains.

In Go, bits.Add64 takes as input and returns the carry.

In C, there are some constructs that modern compilers will recognize as patterns people use to extract the carry, but it is a hit-and-miss.

Even compiler-specific intrinsics are a few and far between, inefficient or even broken.

boomlinde · on April 3, 2019

If you look into the implementations of Add64/Sub64/Mul64 you'll see that it's not a lot of code and that there is nothing magical about them. They can just as easily be implemented in C in some 30 lines of code. Because they concern unsigned integers only there is no UB to consider on overflow/carry.

giovannibajo1 · on April 3, 2019

Actually, those functions are intrinsics in the Go compiler on many platforms, so the compiler generates a single ADD / ADC opcode. The source code that you see is a fallback for backends that have not implemented them as intrinsics yet.

leni536 · on April 3, 2019

gcc can optimize generic code for a carry check. To be fair both clang and msvc fails to optimize this specific case.

https://godbolt.org/z/6WXfnL

Edit: OK, I have much worse time generating "adc" instructions, and only gcc sees the opportunity to emit "setc" where appropriate. I see the issue now.

giovannibajo1 · on April 3, 2019

Correct, gcc can generate "setc" but not "adc", AFAIK.

boomlinde · on April 3, 2019

That's true, but I'm specifically addressing the compactness and readability concern.

With respect to efficiency and particularly timing, Go and C seems to be in the same boat here in that depending on your compiler and platform it may either compile to a constant time chain of add/adc equivalent instructions or something sub-optimal and branchy.

dchest · on April 3, 2019

Careful with this: the default/fallback implementations are not constant-time.

boomlinde · on April 8, 2019

Not intuitively, but it's generally good to be wary of optimizing compilers in such cases. I don't know what the Go compiler outputs, but a straight forward translation to C didn't compile to a constant time operation with (recent) gcc -O3. For all I know might with some other compiler. This didn't result in any branches in gcc, though:

    static uint64_t
    add64(uint64_t x, uint64_t y, uint64_t *carry)
    {
     uint64_t yc, sum;
     yc = y + *carry;
     sum = x + yc;
     *carry = (sum < x || yc < y);
     return sum;
    }

But again, some other compiler on some other platform may have another idea of what to output.

Anyway I agree to be careful. On platforms that have to use the fallback implementations, if they aren't optimized to constant time operations, this will lose one of the important properties of Poly1305.

wahern · on April 3, 2019

They can be, though: https://bearssl.org/gitweb/?p=BearSSL;a=blob;f=src/symcipher...

tarikjn · on April 3, 2019

Great write-up, I haven't dug into understanding the whole code yet, but perhaps someone can answer this for me: can the code be made even simpler using Go's bignum: big.Int?

CaliforniaKarl · on April 3, 2019

One concern may be the lack of a constant-time guarantee. If there are any optimizations that run differently depending on the numbers involved, then that may be an opening to a side-channel attack.

Someone · on April 3, 2019

Upvoted, but such optimizations might exist for integer math, too. I’m not aware of any compiler that gives constant-time guarantees.

The best you can do is to write statements that you think map to certain instructions, and to check every release build to see what instructions the compiler chose.

You also may have to make sure that the buffers you use don’t cross page boundaries.

londons_explore · on April 3, 2019

Unit tests that throw random numbers into the code and check execution time is always a constant number of clock cycles should catch nearly all timing side channels.

dfranke · on April 3, 2019

That won't work. When we say "constant-time code" what we really mean is "code whose running time is independent of its input". It won't really be constant due to factors like the OS scheduler and the initial state of the CPU's instruction cache.

What you really want is https://github.com/agl/ctgrind or something similar.

Someone · on April 3, 2019

As dfranke said, it isn’t trivial to filter out the noise of having other code running in parallel to your code from that. Counting clock cycles, not wall clock time isn’t even enough as some attacks use wall clock time.

And it’s worse: that your code is constant-time in the absence of an adversary doesn’t guarantee that it is so in the presence of one. You will have to stress test your code with code running in other CPU threads that flush various caches, switch CPU modes, do DMA, etc.

Having said that, running with random inputs isn’t a bad idea, if only because you have to get confidence that the CPU manufacturer’s timing data is correct.

olliej · on April 3, 2019

Big integer libraries are written to be as fast as possible, this pretty much guarantees the evaluation of primitive operations is not constant time.

cyphar · on April 3, 2019

Possibly, but I would assume it would would be too slow. While big does have assembly implementations, it is designed to handle very large numbers -- ~int128 sized numbers are probably too small for it to be efficient. I could be wrong though.

giovannibajo1 · on April 3, 2019

Correct. It is impossible for a generic signed/unsigned N-bit library to perform as efficiently as a specific uint128 code path.

jacobvosmaer · on April 3, 2019

I'm not sure but I suspect the algorithm contains optimizations that require direct access to the "limbs" of the big integers.

Or perhaps it's about avoiding allocations?

simonhorlick · on April 3, 2019

I find a lot of go libraries are very well commented and very accessible, but the standard library and especially the crypto stuff is very sparsely commented.