One reason this cannot be translated into a portable C version while retaining its compactness (and readability) is the lack of support for carry/borrow chains.
In Go, bits.Add64 takes as input and returns the carry.
In C, there are some constructs that modern compilers will recognize as patterns people use to extract the carry, but it is a hit-and-miss.
Even compiler-specific intrinsics are a few and far between, inefficient or even broken.
If you look into the implementations of Add64/Sub64/Mul64 you'll see that it's not a lot of code and that there is nothing magical about them. They can just as easily be implemented in C in some 30 lines of code. Because they concern unsigned integers only there is no UB to consider on overflow/carry.
Actually, those functions are intrinsics in the Go compiler on many platforms, so the compiler generates a single ADD / ADC opcode. The source code that you see is a fallback for backends that have not implemented them as intrinsics yet.
Edit: OK, I have much worse time generating "adc" instructions, and only gcc sees the opportunity to emit "setc" where appropriate. I see the issue now.
That's true, but I'm specifically addressing the compactness and readability concern.
With respect to efficiency and particularly timing, Go and C seems to be in the same boat here in that depending on your compiler and platform it may either compile to a constant time chain of add/adc equivalent instructions or something sub-optimal and branchy.
Not intuitively, but it's generally good to be wary of optimizing compilers in such cases. I don't know what the Go compiler outputs, but a straight forward translation to C didn't compile to a constant time operation with (recent) gcc -O3. For all I know might with some other compiler. This didn't result in any branches in gcc, though:
static uint64_t
add64(uint64_t x, uint64_t y, uint64_t *carry)
{
uint64_t yc, sum;
yc = y + *carry;
sum = x + yc;
*carry = (sum < x || yc < y);
return sum;
}
But again, some other compiler on some other platform may have another idea of what to output.
Anyway I agree to be careful. On platforms that have to use the fallback implementations, if they aren't optimized to constant time operations, this will lose one of the important properties of Poly1305.
Great write-up, I haven't dug into understanding the whole code yet, but perhaps someone can answer this for me: can the code be made even simpler using Go's bignum: big.Int?
One concern may be the lack of a constant-time guarantee. If there are any optimizations that run differently depending on the numbers involved, then that may be an opening to a side-channel attack.
Upvoted, but such optimizations might exist for integer math, too. I’m not aware of any compiler that gives constant-time guarantees.
The best you can do is to write statements that you think map to certain instructions, and to check every release build to see what instructions the compiler chose.
You also may have to make sure that the buffers you use don’t cross page boundaries.
Unit tests that throw random numbers into the code and check execution time is always a constant number of clock cycles should catch nearly all timing side channels.
That won't work. When we say "constant-time code" what we really mean is "code whose running time is independent of its input". It won't really be constant due to factors like the OS scheduler and the initial state of the CPU's instruction cache.
As dfranke said, it isn’t trivial to filter out the noise of having other code running in parallel to your code from that. Counting clock cycles, not wall clock time isn’t even enough as some attacks use wall clock time.
And it’s worse: that your code is constant-time in the absence of an adversary doesn’t guarantee that it is so in the presence of one. You will have to stress test your code with code running in other CPU threads that flush various caches, switch CPU modes, do DMA, etc.
Having said that, running with random inputs isn’t a bad idea, if only because you have to get confidence that the CPU manufacturer’s timing data is correct.
Possibly, but I would assume it would would be too slow. While big does have assembly implementations, it is designed to handle very large numbers -- ~int128 sized numbers are probably too small for it to be efficient. I could be wrong though.
I find a lot of go libraries are very well commented and very accessible, but the standard library and especially the crypto stuff is very sparsely commented.
In Go, bits.Add64 takes as input and returns the carry.
In C, there are some constructs that modern compilers will recognize as patterns people use to extract the carry, but it is a hit-and-miss.
Even compiler-specific intrinsics are a few and far between, inefficient or even broken.