Related: I recently published a preprint describing some techniques to speed up evaluation of transcendental functions in the range of a few hundred bits to a few thousand bits: http://arxiv.org/abs/1410.7176
The easiest way to make elementary functions faster is to use a different libm: the glibc preoccupation with being correctly rounded (ie accurate to within 0.5 ulps) imposes significant performance penalties. For example the glibc log function, which is often the dominant cost in a lot of sampling algorithms, is about 2.5x slower than a straightforward implementation (ie no fancy SIMD tricks) of Tang's 25 year old table lookup algorithm, which is accurate to 0.57 ulps. Recent OS X libms are faster still, and appear to be accurate to about 0.51 ulps, though unfortunately are closed source.
I remember being amazed in the final year of college, the PowerMac 7600 I'd spent my student loan on, was running my FORTRAN as quick as the DEC Alphas in the lab.
There's an x87 asm implementation for x86-32 in sysdeps/i386/fpu/e_exp.S [1]. For x86-64, it falls back to the generic C code for exp(), in sysdeps/ieee754/dbl-64/e_exp.c. The sad part is that there is x86-64 asm for a few variants (exp10l, exp2l, expf, expl), but since these are not C90, people tend to avoid them for more portable versions.
So, ironically, it can be orders of magnitude faster to compute a long-double expl() than a double-precision exp().
[1] Even the asm contains a lot of extra code to handle edge cases that Intel screwed up, like returning NaN for exp(+/- Inf).
The 387 instructions for elementary functions do not have the reputation of having a good accuracy/time ratio compared to, say, state-of-the-art implementations using polynomial approximations.
Do you have actual measurements that reveal some actual irony, or is it only superficial irony in that expl() is computed using the (microcoded, slow) hardware instruction and exp() isn't?
Note: For most functions, you can actually make a pretty accurate double-precision implementation by calling the corresponding 387 instruction and then rounding the result to double-precision. Not a correctly rounded implementation (that would require a wider intermediate result than 80-bit) but pretty good nonetheless.
We didn't measure average case performance, but for worst-case performance the "orders of magnitude" comment is absolutely backed by measurement. See also my comment at https://news.ycombinator.com/item?id=8830667
In our case, we only needed about 12 bits of accuracy anyway, so we wound up writing our own implementation using polynomial approximations (which have the benefit of predictable running times regardless of the quality of your C library). The fact that you need to do this in any context where worst-case performance is a concern might be surprising to some people. It was to me before I read glibc's code.
I take it back, we did measure average case performance of expf() (though not expl()), and it was slower than exp():
16:12:48 < gmaxwell> ha. So, completely eliminating all the doubles made celt slower.
16:13:23 < gmaxwell> I'm guessing that this is because I used the C99 float versions of the math functions and they suck.
16:13:57 < gmaxwell> looks like 3% slower or so.
16:14:05 < gmaxwell> <3 GLIBC.
16:26:02 < gmaxwell> with float approx the no-doubles is .9% faster. Whoohoo. ;)
17:24:06 < derf> Yes. Don't use the float versions of libc functions, at least not on x86-64.
17:24:24 < Dark_Shikari> what the heck did they do ?
17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(), for example, regardless of whether or not they're already correctly set up.
17:25:07 < derf> Apparently this is really slow.
17:25:32 < gmaxwell> derf: I was hoping that they'd be fixed by now.
"Float approx" refers to our custom polynomial approximation of exp(). Given that, I think it's safe to infer that expl() would also have been slower than exp() in the average case. That was in 2010, though. This appears to have been fixed for expf() in 2012 (which now uses SSE/AVX instead of x87 asm on x86-64), but not for expl().
> 17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(), for example, regardless of whether or not they're already correctly set up.
I fixed this around the same time I worked on the mp improvements:
Set/restore rounding mode only when needed
The most common use case of math functions is with default rounding
mode, i.e. rounding to nearest. Setting and restoring rounding mode
is an unnecessary overhead for this, so I've added support for a
context, which does the set/restore only if the FP status needs a
change. The code is written such that only x86 uses these. Other
architectures should be unaffected by it, but would definitely benefit
if the set/restore has as much overhead relative to the rest of the
code, as the x86 bits do.
What I meant is that the assembly instruction is faster only because it does not provide the same level of accuracy as the generic implementation.
Everyone's needs are different, but it does seem that it is a little bit early, and perhaps that it will always be too early, for a general-purpose libm to aim for correctly rounded transcendental function. Even the best implementations only work in one rounding mode, leading to the pleasantries discussed in a sibling of this comment, and the worst-case execution time can be a problem for some.
It is possible to aim for 0.51 ULP with implementations that still work(-ish) if the rounding mode is not round-to-nearest, and perhaps this is the sweet spot that libms that want to be useful should aim for.
I believe on x86/x64 CPUs there's an advantage to doing these operations via SSE rather than using the old x87 opcodes. For instance, you don't need to move data into the x87 registers.
Implementations in CRLibM are correctly rounded (i.e. as accurate as possible) and are, in my benchmarks, faster than GNU LibM. Thus, they are a Pareto improvement over GNU LibM.
The crlibm code is terrible to read and hence maintain. I had considered including crlibm too during my work but decided against it because it would be a lot more work.
Also, crlibm functions are provably correctly rounded only for univariate transcendentals (thanks to the worst case paper I cited in the blog). There is no such proof available for bivariate functions, which is why computing at an arbitrary very large precision (the way glibc does) is the only way to increase the likelihood of correctly rounded results. IIRC crlibm does not even try for bivariates.
I believe some documentation on the crlibm website cites benchmark numbers where IIRC glibc performs better (although not by much) in the fast paths of most functions they mention but gets hit really badly in the slow path. Slow path inputs are usually quite rare, but if your application/benchmark hits them regularly enough (especially sice the slow path is a factor of 1000 times worse) then you'll obviously conclude that the performance sucks and that crlibm is better.
Sorry, I don't buy the argument that "code is terrible to read and hence maintain" from a GLibC developer. CRLibM does have a bunch or tricky macros to work-around bugs in decade-old compilers for architectures no one cares about, but GNU LibC is not any better in this aspect.
CRLibM does not have a provably correctly-rounded pow, but univariate functions can be merged.
The paper you cited is about decimal floating-point implementation, it is not relevant to the LibM functions discussed in the article.
In many contexts, the problem is not the absolute performance, but the variation in performance. For example, we want the Opus codec to be relatively insensitive to timing analysis, since Opus audio is often encrypted and transmitted over a network. Large variations represent an information leak that could be exploited by an external observer.
When we started looking at actual timings, we found that it was mostly pretty consistent, except for huge slowdowns in some frames. The cause? glibc's exp() slow path.
We have had this in our todo list for a while. The leading idea currently is to provide compiler selectable function variants that don't fall into the multiple precision abyss. The biggest challenge here is to come up with data about accuracy of results without the mp path.
No wonder people don't like doing floating-point calculations if theres a chance with certain input to have your pow-call fallthrough into a 1000x slower code path.
I mean, that should probably be an option. It's realloc all over, plain unusable because of the rare chance the function goes mad and copies data instead of failing.
Do you have a reference for the claim that realloc() is unusable for the reason you say?
Many users of floating-point are not using floating-point in hard real-time contexts. Even soft real-time contexts can be fine with Ziv-style successive approximations. Correctly rounded library functions have intangible benefits beyond precision (portability, for instance), but if you don't value these benefits, you can always use inaccurate and always fast functions. You don't even need to invent your own: just rip off the first stage of a multi-stage Ziv implementation.
I don't think any of this is the reason why “people don't like doing floating-point calculations”.