> What's crazy about this is that we're limited to multiplying integers
> of about 25 bits [...]
> Why? Because the multiplier insists on rounding
> its low output bits rather than giving them back to the programmer.
> Why? Because the multiplier is buried inside the floating-point unit.
The reference to the 25-bit success makes it seem like the author is thinking of bits 53-105 of the 106-bit result of multiplying two 53-bit inputs.
The multiplier is not going to “give them back” to the programmer because it never computed them in the first place. Floating-point 53x53->53 multiplication is implemented with a couple of guard bits, the last of which is “sticky”. The multiplier the author is asking for is not “buried”, it does not exist in the processor.
I am not sure why integer multiplication is not preferred. It goes up to 64x64->128, if the author wants all the bits of the product.
> A 53-bit multiplier producing a full 106-bit product won't be much larger than a 53-bit multiplier producing a correctly rounded 53-bit floating-point result.
Yes, it will. It will have to be 106-bit wide instead of 56-bit wide.
I am not sure why integer multiplication is not preferred. It goes up to 64x64->128, if the author wants all the bits of the product.
I'm a little confused by this also, but given the level of detail in the linked paper (http://eprint.iacr.org/2014/134.pdf) it's definitely not just an oversight. The issue may be that he's aiming for fastest possible performance on Sandy Bridge (AVX), and 256-bit integer vector multiplication wasn't added until Haswell (AVX2). And even then, 64-bit integer elements can only be done one at a time. But since he's suggesting instruction set changes, he's clearly not thinking only of backwards compatibility.
I think the design of his Curve25519 elliptic curve assumes 25-bit integers implemented using floating point arithmetic and requires them for maximum performance. I certainly haven't seen any fast integer-only implementations of it.
How do you correctly round, say, 6172293634027511 * 7059478094414279 / 2^53 = 4837593847918366.50000000000000011102 without essentially computing the full 106-bit product?
The idea is that you can use guard digits instead of full products and get correct rounding (as if to infinite precision). For IEEE-754 math you need 3 guard digits, typically named Guard, Round, and Sticky. The trick at the hardware level is that Sticky represents the OR of a bunch of bits. Instead of keeping the full 106 bit partial product at each step you only need to keep the 53+3 bit product. As you shift+add the partial products you use special rules to keep track of the Sticky bit and you will still get the right answer.
I'm aware of the 53+3 bit algorithm for addition and subtraction, but I don't see how this could possibly work for multiplication. Could you please point me to a paper that explains how to compute a correctly rounded 53-bit product without computing the full 106-bit product?
Using a sticky bit means computing the logical-or of all the bits that are sufficiently far to the right in all these partial results, instead of computing them individually and summing them.
Like for addition, the idea is that the result you obtain is different from the exact result, but rounds the same. The sticky bit only serves to round (midpoint + a small quantity) correctly up.
Perhaps a better way to describe it is that during the additions, the accumulator only ever needs to be shifted right in order to be aligned with the next number to add, never left. So the accumulator can be 53+3-bit wide and carry at each step the 53+3-bit representation of the partial sum already computed, and this is enough to produce the correctly rounded result after all numbers have been added.
This explanation assumes that the N partial products are added sequentially. I don't know what techniques are used to achieve the 4- or 5-cycle latencies of modern desktop processors. Implementations with more parallelism and/or that are part of an FMA unit have it harder, but I still expect that they don't naively compute a full width partial result before rounding it (in the case of the FMA, it would be impractical as the partial result could be thousands of bits wide)
Looks like you're actually computing the full product. The only optimization you're doing is forgetting the low order bits as soon as you're done with them. The author is asking to keep them, which sounds reasonable. You initially claimed they were "never computed in the first place", and we're asking you to back up that claim if you still think it's true.
I see what you mean now. I agree that each of the low-order bits of the 106-bit product is available at some point. Not necessarily simultaneously, but yes, it seems like little additional effort for processor designers to keep these somewhere and assemble them in another register or something.
For some reason (maximum depth reached?) there is no “reply” link at the bottom of the post of yours at which I understood, so I am replying here.
My example has nothing to do with decimal rounding. It's an example of a product (6172293634027511 * 7059478094414279) whose result is extremely close to the fence between two representable numbers (4837593847918366.5 * 2^53). If the 106-bit product happened to be 2 units smaller, then the rounded result would change.
At best, Intel at one point provided 80-bit rounding (ie: 68-bit mantissa). That is, if you use the x87 floating-point coprocessor. But these technically do not match IEEE specifications for rounding.
Er, what? Floating point numbers do "round correctly" (in binary) -- that's pretty much the whole point of the IEEE standard.
In particular, to compute a product of two 53-bit floating-point numbers with correct rounding (as the standard mandates), it is certainly not sufficient in general to compute just a 54-bit approximation and round that.
The worst case is that today you promise secrecy for, e.g., the MUL inputs, and then realize in several years that you really want to make a faster variable-time multiplier; the commitment would then force you to add a new MULABORT instruction rather than violating the secrecy of the MUL instruction. Of course, if you don't make any commitments, then programmers have no choice but to rely on observations of CPU behavior; many of those programmers will assume constant-time MUL, and if you switch to variable-time MUL then you will be breaking cryptographic security.
I disagree with this and think he has it backwards; the majority of applications will benefit from early-out multiplication, and it's only crypto that wants constant-time. Also, MUL has never been constant-time on x86, ever since the 8086. My suggestion would be to add a control register bit that puts the CPU into a mode where the guarantees needed for crypto are true. All that software would need to do is set the bit before performing sensitive operations, and clear it afterwards.
I believe that's his point; I read that section not as "please make these instructions constant-time" and more as "please document whether these instructions are constant-time, so that we can rely on them being constant-time". If an instruction is constant time but then a new chipset comes out with early-exit semantics for multiplication, everything previously compiled for that architecture is suddenly vulnerable to timing attacks when it wasn't previously. He's just looking for a commitment for the leak and timing semantics of various instructions, even if the commitment is "you cannot rely upon this being constant-time".
DJB is spot-on for the pipeline documentation. This can help many apps and workflows, ranging from the one he describes ( tiny tightly optimized cryptography code) all the way up to the entire JDK, JavaScript V8, Erlang, Beam and similar large VMs.
Except modern architectures don't have "pipelines"... at least not like ARM's pipelines are done. Modern Architectures have decoders to translate the assembly into micro-ops which are then stored into a cache (beyond the L1 cache) and then computed.
Heck, the entire concept of Intel's "Hyperthreading" kills any pipeline determinism you would hope to achieve.
Consider your typical ARM read-modify stall: add r1, r2 / add r1, r3. A stall needs to be inserted into the pipeline because r1's value will change.
In contrast, the typical Out-of-Order (OoO) x86 computer (Steamroller, Ivy Bridge, and Bay Trail) will attempt to look at future instructions to execute during the pipeline stall. In the case of Hyperthreaded CPUs (Intel i7, i3, older Atoms), Intel will grab instructions from _another thread_ to execute while the stall is undergoing.
And worse yet, as Intel and AMD upgrade their systems, sometimes there are massive architectural differences which cause regressions in some code. The infamous AMD Bulldozer (2011 chip) regressed in FPU and Vectorized code for instance, while AMD tried to up the core count and improve integer performance.
If you really want to step into this arena, just use Agner Fog's optimization manual.
But... it isn't as simple as ARM because x86 CPUs are far more complicated. Out-of-order processors just kill any determinism you have.... let alone hyperthreading or the other tricks these bigger CPUs do.
Out of order processors do not "kill" determinism. If you can guarantee your initial state is the same you will get the same result in the same number of cycles repeatably. The fact that time it takes is non-obvious does not make it non-deterministic.
Now hyperthreading does introduce real non-determinism from the standpoint of a single threaded calculation because it doesn't control the other thread using processor resources, but hyper threading introduces that sort of non-determinism even on in order processors.
Well sure, different implementations behave differently. It still has nothing to do with being OoOE...
A 486 and an original Atom are both in order implementations the i386 architecture (though both also implement other extensions), but knowing how code runs on one tells me nothing about the other, and if I just have an ISA reference the i386 without additional documentation or hardware to test on I won't be able to predict how code I write performs on either of them.
Intel actually provides better documentation on its CPU pipelines than ARM does, especially for anything more advanced than the Cortex-A9. Then with Cortex-A5 and A7 you don't even get instruction timings.
Cortex-A15 you get nothing official, even though there's quite a few hidden gotcha's that are critical for performance (Did you know that the NEON pipelines are 64-bit wide? Or that there's only 1 VFP pipeline whereas there's two symmetric NEON pipelines? That's some of the barest of basics, yet nowhere does ARM publicly document this info.)
Intel you get execution port categories (though you have to infer exactly which instructions fall into which categories), official timings (again inferring precise µop counts), and all sorts of details as to the inner workings of pipelines (random example: there's an entire section about how store forwarding works in Nehalem in the optimization manual, which is quite beyond the detail you'd see in any official ARM manual.)
And that's not counting Agner Fog and the details Intel feeds tech journalists.
I think one of the reasons ARM doesn't really want to provide instruction timings is because it sells IP cores, not hardware. Others who integrate ARM cores may make changes to the design that mean their instruction timings will be different, whereas Intel designs and manufacturers the CPUs themselves.
In software terms, that's more difficult than adding support for new HTML extensions given java byte code for a web browser.
No one makes modifications that could affect instruction timings (which is a very fundamental aspect of a core), and I very much doubt such modifications are even allowed in the license you buy the cores under, even with a separate architectural license (which is for ground-up designs). Even NVIDIA who bragged about their A9 and A15 core revisions only gave feedback to ARM and had to have them actually implement the changes, and none of those changes affected instruction timing.
Different ARMv7 designs have different timings, but I think Cortex A15 is always the same, though running at different frequencies and voltages and consuming different amounts of power.
How to design an elliptic-curve signature system — There are many choices of elliptic-curve signature systems. The standard choice, ECDSA, is reasonable if you don't care about simplicity, speed, and security.
He goes on to demonstrate his point in minute details.
At what point does it make sense to have a separate crypto-core / offload-engine?
You could provide all of the isolation and funky instructions that you want without affecting the general purpose CPU. It may also be more efficient as it is a special purpose processor.
Those certainly exist and are used frequently in systems that require very high performance encryption, but what DJB is trying to do is ubiquitize encryption capability. A massive first step was taken when AES instructions were added to Intel CPUs, but more could be done. The advantage of having it be ubiquitous in the CPU is overall cost is reduced and the ability to utilize encryption in more and more places is increased since users aren't stuck waiting.
I imagine this would only be more expensive in die area and cost, because you'd duplicate existing circuits. For example, one of djb's suggestions is to reuse the existing floating-point unit for crypto: to make small changes, exposing the functionality of the internal, 53-bit integer multiplier that's a component of the 64-bit floating point multiplier. There's already eight of these in each core (with AVX, since 2011), costing a formidable number of transistors. Creating a new crypto unit, either in the CPU or a coprocessor, would duplicate much of them.
As djb emphasizes in the article, if it's too expensive it might not make sense for Intel to implement it.
I doubt they will ever become ubiquitous. For most people CPU crypto will always be good enough and saving a few dollars is more important. Otherwise you would already own one.
> of about 25 bits [...]
> Why? Because the multiplier insists on rounding
> its low output bits rather than giving them back to the programmer.
> Why? Because the multiplier is buried inside the floating-point unit.
The reference to the 25-bit success makes it seem like the author is thinking of bits 53-105 of the 106-bit result of multiplying two 53-bit inputs.
The multiplier is not going to “give them back” to the programmer because it never computed them in the first place. Floating-point 53x53->53 multiplication is implemented with a couple of guard bits, the last of which is “sticky”. The multiplier the author is asking for is not “buried”, it does not exist in the processor.
I am not sure why integer multiplication is not preferred. It goes up to 64x64->128, if the author wants all the bits of the product.
> A 53-bit multiplier producing a full 106-bit product won't be much larger than a 53-bit multiplier producing a correctly rounded 53-bit floating-point result.
Yes, it will. It will have to be 106-bit wide instead of 56-bit wide.