Several things: - The fprem1 instruction is actually a long microcode sequence a...

Several things:

- The fprem1 instruction is actually a long microcode sequence and is quite slow; 26-50 cycles on SNB according to Agner Fog. Several iterations of that loop are necessary for a complete reduction of some operands.

- There is no analogous instruction on arm (or really, any platform that isn't x86), anyway.

- If you're using the floating-point remainder operation in a performance-sensitive context, You're Doing It Wrong. Programmers have gotten so used to this that there is little value in optimizing remainder; it is rarely used in situations where the optimization would matter.