Improving math performance in glibc

fdej · on Jan 3, 2015

Related: I recently published a preprint describing some techniques to speed up evaluation of transcendental functions in the range of a few hundred bits to a few thousand bits: http://arxiv.org/abs/1410.7176

simonbyrne · on Jan 3, 2015

The easiest way to make elementary functions faster is to use a different libm: the glibc preoccupation with being correctly rounded (ie accurate to within 0.5 ulps) imposes significant performance penalties. For example the glibc log function, which is often the dominant cost in a lot of sampling algorithms, is about 2.5x slower than a straightforward implementation (ie no fancy SIMD tricks) of Tang's 25 year old table lookup algorithm, which is accurate to 0.57 ulps. Recent OS X libms are faster still, and appear to be accurate to about 0.51 ulps, though unfortunately are closed source.

gaius · on Jan 3, 2015

Back in the good old days, Motorola provided a drop-in highly optimized replacement maths library for Power Macs http://uk.mathworks.com/matlabcentral/newsreader/view_thread...

I remember being amazed in the final year of college, the PowerMac 7600 I'd spent my student loan on, was running my FORTRAN as quick as the DEC Alphas in the lab.

grandinj · on Jan 3, 2015

Maybe this is a place where GLIBC could provide additional variants with improved performance and reduced rounding accuracy.

csense · on Jan 3, 2015

Aren't these implemented in hardware on most CPU's? Or is this floating point emulation for CPU's that don't have built-in floating point?

gsg · on Jan 3, 2015

The hardware implementations often have problems: http://randomascii.wordpress.com/2014/10/09/intel-underestim...

derf_ · on Jan 3, 2015

There's an x87 asm implementation for x86-32 in sysdeps/i386/fpu/e_exp.S [1]. For x86-64, it falls back to the generic C code for exp(), in sysdeps/ieee754/dbl-64/e_exp.c. The sad part is that there is x86-64 asm for a few variants (exp10l, exp2l, expf, expl), but since these are not C90, people tend to avoid them for more portable versions.

So, ironically, it can be orders of magnitude faster to compute a long-double expl() than a double-precision exp().

[1] Even the asm contains a lot of extra code to handle edge cases that Intel screwed up, like returning NaN for exp(+/- Inf).

pascal_cuoq · on Jan 3, 2015

The 387 instructions for elementary functions do not have the reputation of having a good accuracy/time ratio compared to, say, state-of-the-art implementations using polynomial approximations.

Do you have actual measurements that reveal some actual irony, or is it only superficial irony in that expl() is computed using the (microcoded, slow) hardware instruction and exp() isn't?

Note: For most functions, you can actually make a pretty accurate double-precision implementation by calling the corresponding 387 instruction and then rounding the result to double-precision. Not a correctly rounded implementation (that would require a wider intermediate result than 80-bit) but pretty good nonetheless.

derf_ · on Jan 3, 2015

We didn't measure average case performance, but for worst-case performance the "orders of magnitude" comment is absolutely backed by measurement. See also my comment at https://news.ycombinator.com/item?id=8830667

In our case, we only needed about 12 bits of accuracy anyway, so we wound up writing our own implementation using polynomial approximations (which have the benefit of predictable running times regardless of the quality of your C library). The fact that you need to do this in any context where worst-case performance is a concern might be surprising to some people. It was to me before I read glibc's code.

derf_ · on Jan 4, 2015

I take it back, we did measure average case performance of expf() (though not expl()), and it was slower than exp():

16:12:48 < gmaxwell> ha. So, completely eliminating all the doubles made celt slower.

16:13:23 < gmaxwell> I'm guessing that this is because I used the C99 float versions of the math functions and they suck.

16:13:57 < gmaxwell> looks like 3% slower or so.

16:14:05 < gmaxwell> <3 GLIBC.

16:26:02 < gmaxwell> with float approx the no-doubles is .9% faster. Whoohoo. ;)

17:24:06 < derf> Yes. Don't use the float versions of libc functions, at least not on x86-64.

17:24:24 < Dark_Shikari> what the heck did they do ?

17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(), for example, regardless of whether or not they're already correctly set up.

17:25:07 < derf> Apparently this is really slow.

17:25:32 < gmaxwell> derf: I was hoping that they'd be fixed by now.

"Float approx" refers to our custom polynomial approximation of exp(). Given that, I think it's safe to infer that expl() would also have been slower than exp() in the average case. That was in 2010, though. This appears to have been fixed for expf() in 2012 (which now uses SSE/AVX instead of x87 asm on x86-64), but not for expl().

siddhesh · on Jan 4, 2015

> 17:24:53 < derf> They reset the rounding modes on entrance and exit of expf(), for example, regardless of whether or not they're already correctly set up.

I fixed this around the same time I worked on the mp improvements:

commit 2506109403de69bd454de27835d42e6eb6ec3abc Author: Siddhesh Poyarekar <siddhesh@redhat.com> Date: Wed Jun 12 10:36:48 2013 +0530

    Set/restore rounding mode only when needed
    
    The most common use case of math functions is with default rounding
    mode, i.e. rounding to nearest.  Setting and restoring rounding mode
    is an unnecessary overhead for this, so I've added support for a
    context, which does the set/restore only if the FP status needs a
    change.  The code is written such that only x86 uses these.  Other
    architectures should be unaffected by it, but would definitely benefit
    if the set/restore has as much overhead relative to the rest of the
    code, as the x86 bits do.

derf_ · on Jan 5, 2015

That patch doesn't affect the x87 asm in sysdeps/x86_64/fpu/e_expl.S, though, does it? Good to hear it's fixed for some cases, though.

pascal_cuoq · on Jan 5, 2015

What I meant is that the assembly instruction is faster only because it does not provide the same level of accuracy as the generic implementation.

Everyone's needs are different, but it does seem that it is a little bit early, and perhaps that it will always be too early, for a general-purpose libm to aim for correctly rounded transcendental function. Even the best implementations only work in one rounding mode, leading to the pleasantries discussed in a sibling of this comment, and the worst-case execution time can be a problem for some.

It is possible to aim for 0.51 ULP with implementations that still work(-ish) if the rounding mode is not round-to-nearest, and perhaps this is the sweet spot that libms that want to be useful should aim for.

CurtHagenlocher · on Jan 3, 2015

I believe on x86/x64 CPUs there's an advantage to doing these operations via SSE rather than using the old x87 opcodes. For instance, you don't need to move data into the x87 registers.

Narishma · on Jan 3, 2015

Right, you have to move it instead into the SSE registers.

FeepingCreature · on Jan 3, 2015

Where it should be living anyways.

Marat_Dukhan · on Jan 2, 2015

Why not just merge the CRLibM implementations?

GFK_of_xmaspast · on Jan 3, 2015

Are they actual improvements? The project looks like it's been dead almost five years.

(Second guess: ideology)

Marat_Dukhan · on Jan 3, 2015

Implementations in CRLibM are correctly rounded (i.e. as accurate as possible) and are, in my benchmarks, faster than GNU LibM. Thus, they are a Pareto improvement over GNU LibM.

siddhesh · on Jan 3, 2015

The crlibm code is terrible to read and hence maintain. I had considered including crlibm too during my work but decided against it because it would be a lot more work.

Also, crlibm functions are provably correctly rounded only for univariate transcendentals (thanks to the worst case paper I cited in the blog). There is no such proof available for bivariate functions, which is why computing at an arbitrary very large precision (the way glibc does) is the only way to increase the likelihood of correctly rounded results. IIRC crlibm does not even try for bivariates.

I believe some documentation on the crlibm website cites benchmark numbers where IIRC glibc performs better (although not by much) in the fast paths of most functions they mention but gets hit really badly in the slow path. Slow path inputs are usually quite rare, but if your application/benchmark hits them regularly enough (especially sice the slow path is a factor of 1000 times worse) then you'll obviously conclude that the performance sucks and that crlibm is better.

Marat_Dukhan · on Jan 3, 2015

Sorry, I don't buy the argument that "code is terrible to read and hence maintain" from a GLibC developer. CRLibM does have a bunch or tricky macros to work-around bugs in decade-old compilers for architectures no one cares about, but GNU LibC is not any better in this aspect.

CRLibM does not have a provably correctly-rounded pow, but univariate functions can be merged.

The paper you cited is about decimal floating-point implementation, it is not relevant to the LibM functions discussed in the article.

I have a utility for benchmarking LibM libraries here: https://bitbucket.org/MDukhan/hysteria

Some benchmarks including CRLibM and GNU LibM are here: http://www.yeppp.info/benchmarks.html

siddhesh · on Jan 4, 2015

Yes the link is incorrect. I am working on getting that fixed.

derf_ · on Jan 3, 2015

In many contexts, the problem is not the absolute performance, but the variation in performance. For example, we want the Opus codec to be relatively insensitive to timing analysis, since Opus audio is often encrypted and transmitted over a network. Large variations represent an information leak that could be exploited by an external observer.

When we started looking at actual timings, we found that it was mostly pretty consistent, except for huge slowdowns in some frames. The cause? glibc's exp() slow path.

siddhesh · on Jan 4, 2015

We have had this in our todo list for a while. The leading idea currently is to provide compiler selectable function variants that don't fall into the multiple precision abyss. The biggest challenge here is to come up with data about accuracy of results without the mp path.

nullc · on Jan 11, 2015

For single precision single argument functions you can exhaustively test to obtain data about the accuracy of the results.

derf_ · on Jan 4, 2015

That's good to hear.

cschwan · on Jan 3, 2015

Did anyone find the authors commits in glibc's repository? Would be nice to know what he really changed and when.

siddhesh · on Jan 4, 2015

These are all the changes I made in math during 2.18. They're almost all mp stuff with some code cleanup thrown in.

git log --author=siddhesh release/2.18/master -- sysdeps/ieee754/dbl-64

MichaelCrawford · on Jan 3, 2015

I am tinkering around with making some library code faster. I don't have definite results yet but I expect to soon.

Consider the comment in glibc's qsort that says something like "this works well on a Sun 4/260". That machine is about 25 years old.

When I have something to report of course HN will be the first to know.

My main focus is the reduction of energy consumption. Often but not always that can be done by making the code faster.

revelation · on Jan 3, 2015

No wonder people don't like doing floating-point calculations if theres a chance with certain input to have your pow-call fallthrough into a 1000x slower code path.

I mean, that should probably be an option. It's realloc all over, plain unusable because of the rare chance the function goes mad and copies data instead of failing.

pascal_cuoq · on Jan 3, 2015

Do you have a reference for the claim that realloc() is unusable for the reason you say?

Many users of floating-point are not using floating-point in hard real-time contexts. Even soft real-time contexts can be fine with Ziv-style successive approximations. Correctly rounded library functions have intangible benefits beyond precision (portability, for instance), but if you don't value these benefits, you can always use inaccurate and always fast functions. You don't even need to invent your own: just rip off the first stage of a multi-stage Ziv implementation.

I don't think any of this is the reason why “people don't like doing floating-point calculations”.

GFK_of_xmaspast · on Jan 3, 2015

Why are you blaming this on ~~~floating point~~~ instead of glibc?