Floating point oddity

shakna · on Nov 12, 2019

> Incidentally, bc -l gives the right result, no matter how small you set scale.

There's an easy explanation here: bc has it's own arbitrary precision number library embedded in it, so it doesn't make use of floats, doubles or even quads at all.

The number library is surprisingly readable [1] for such a complex topic, and doesn't use complex types at all, relying on ints and chars and pointers inside the struct it uses to store the numbers.

It's been wrapper up for a couple languages, I stumbled over it with a Lua wrapper [0]. Useful when you want better precision than you currently have, but don't want to drag in something like GMP and are okay with minor edge cases sometimes appearing.

[0] https://webserver2.tecgraf.puc-rio.br/~lhf/ftp/lua/#lbc

[1] src/number.c & src/number.h from https://www.gnu.org/software/bc/

namirez · on Nov 12, 2019

This has nothing to do with C vs. C++. It's a fundamental limitation of floating point arithmetic. Here is the Taylor series of \exp:

\exp(x) = 1 + x + higher order terms.

Sometimes you can represent x by a floating number but 1+x will be 1 because the difference has to go from exponent into mantissa. This happens when the number is between these two values:

(std::numeric_limits<T>::lowest, std::numeric_limits<T>::epsilon)

johndcook · on Nov 12, 2019

I thought there was a difference between a C and C++ implementation, but it was an error on my part. When I fixed the bug, I changed the post title. But the post hit HN while I was in the process of fixing it.

dataflow · on Nov 12, 2019

How does expm1 not work for you? It works for me.

dataflow · on Nov 12, 2019

tl;dr: expm1 does work. Something funny is going on if it doesn't work for you.

~~~

How this works out in practice is that once you try your hands on some serious floating-point, you quickly learn that an expression like x == 0 (or even like x <= 0) is a strong code smell. It's an instant red flag that you need to drop what you're doing and think things through extremely carefully first. So you'd stop right there and try to get some intuition for what's going on.

Now the intuition you want here is that floating point is logarithmically spaced around 0 rather than around 1, so if you do exp(x) for small x, it won't have adequate resolution to return anything but 1, because the next-highest jump is too large to be accurate. So when you subtract 1, you get zero.

But... that's precisely why expm1 exists: it's to evaluate exp(x) - 1 accurately. And in fact, it should work. And it does work for me. Why doesn't it work here? It sounds like either a bad implementation or the wrong compiler flags. We really need to know the compiler flags that are being used.

For what it's worth: floating-point can be a lot worse than this, where there's no simple red flag or library function like this that you can recognize. In my experience, a rule of thumb (somewhat of a yellow flag) is that, if you start doing nonlinear operations (even division), you gotta roll up your sleeves and get ready for some fun...

acqq · on Nov 12, 2019

> once you try your hands on some serious floating-point, you quickly learn that an expression like x == 0 (or even like x <= 0) is a strong code smell.

Yes, specifically here is that comparison that surely should not be written as such:

    template <typename T> T e(T x) {
      return x == 0. ? 1. : (exp(x) - 1.)/x;
    }

and also:

https://en.cppreference.com/w/cpp/numeric/math/expm1

gumby · on Nov 12, 2019

> x == 0 (or even like x <= 0) is a strong code smell.

Careful: 0.0f is defined by the standard to be 0x0. So such comparisons can be quite reasonable.

However trying for 0 as the result of a computation, as is used in this example is indeed problematic.

But there is plenty of reasonable code that initializes a float or double variable to 0.0 and can expect to be able to inspect it and find its bit pattern to be 0x0 before that variable is manipulated. After that, all bets are off.

gjm11 · on Nov 12, 2019

The point isn't just that you can't tell exactly when something is zero, although indeed you often can't.

It's that if you have some calculation that needs to be special-cased at zero (e.g., because you're dividing by something and it might explode) then it probably actually needs to be special-cased near zero (because near zero it might overflow or run into other numerical problems), and it's probably worth seeing if there's another way of organizing the calculation that doesn't misbehave for small values in the first place.

Arnavion · on Nov 12, 2019

>I tried replacing exp(x) - 1 with expm1(x). This made no difference.

Interestingly, in Rust, all four q(x) are non-zero, using f32::exp / f64::exp in e(x) consistently gives 0, and using f32::exp_m1 / f64::exp_m1 consistently gives 1. (Given non-zero q(x), this is to be expected.) Both f32::exp_m1 and f64::exp_m1 are implemented using libm's expm1 [1], so I wonder what the author's C++ compiler is doing instead.

https://play.rust-lang.org/?version=stable&mode=release&edit...

[1]: https://doc.rust-lang.org/src/std/f64.rs.html#724-726 + https://github.com/rust-lang/rust/blob/bc0e288ad02ef362b5a6c...

Edit: Actually, I can't reproduce the author's situation on godbolt. The expm1 method consistenly returns 1. https://gcc.godbolt.org/z/RvsdHU

veselin · on Nov 12, 2019

There is an interesting paper that can automatically rewrite expressions to improve the floating point errors.

http://herbie.uwplse.org/

dataflow · on Nov 12, 2019

This looks... too good to be true. And indeed, I just tested it out on the quadratic formula, and the result doesn't seem to be terribly useful? At least in light of the problems that actually come up when trying to use the quadratic formula. (Which btw is quite a fun exercise to figure out if this doesn't ring a bell.)

xondono · on Nov 12, 2019

I don't want to sound dismissive, but isn't this just a problem with numerical instability? How is this different from trying to solve the following system?

  x^2+(2+y)*x+1=2
  x^2+(2 -y)*x+1=3

And playing with the value of y.

I know that this example involves a sign change while the OP doesn't, but it's the same kind problem essentially, isn't it?

If I have learnt anything building PID loops on microcontrollers is that floating point can be a PITA if you are near this kind of situations.

martincmartin · on Nov 12, 2019

isn't this just a problem with numerical instability?

Yes, but many developers aren't aware of numeric instability. Or, maybe they've heard the term, but not how it applies in practice.

People seem to go through stages when learning about floating point:

1. "When you need fractions, use floating point." At this stage, they expect "x / 3 * 3 == x" to always be true, and are surprised when things like that fail.

2. "Floating point is just approximate. So always use double, which is accurate to 15 decimal places." So this explains why you don't get exact answers, but of course doesn't explain the numerical instability in the OP's post.

3. "Cancellation can cause results to be arbitrarily far off! We need to analyze the error propagation of every expression!" This is impractical, and developers in stage 2 look at you like you're over engineering the system.

4. "In practice, using doubles 'just works' in almost all situations." If you're lucky, you can recognize when it might not work, and do error analysis only in those cases. If not, you just don't sleep well at night, wondering whether or where your code will break down. Probably some of both.

coldcode · on Nov 12, 2019

I learned about this back in the early 80's writing a transcendental runtime library (to be used on the F-16 fighter with Jovial). I was always afraid to read a story about some pilot in a dog fight launching a missile and having it go nuts and they were shot down because the code had issues.

_nalply · on Nov 12, 2019

I tried this with Julia and got the answer 1.

    e(x) = x == 0 ? 1 : (exp(x) - 1) / x
    q(x) = abs(x - sqrt(x^2 + 1)) - 1 / (x + sqrt(x^2 + 1))
    h(x) = e(q(x) ^ 2)

What's happening here?

Is Julia switching to arbitrary precision like bc -l or rewriting exp(x) - 1 to expm1(x) or doing something else? Seems a lot of magic to me.

StefanKarpinski · on Nov 12, 2019

Not sure what’s going on but Julia is definitely not switching to arbitrary precision without being explicitly asked to do so. That would be very unjulian.

I’m not at a computer or I’d investigate. One hypothesis based on some other comments is that getting the wrong answer (especially when using expm1) may be due to unsafe C compiler flags, which Julia does not apply by default. Just a guess though.

Update from reading more comments: it seems that Rust and C via godbolt without questionable C flags also don’t fail like this. So I wonder if John hadn’t compiled the code with fast-math on, which will allow me the compiler to give all kinds of wrong answers.

_nalply · on Nov 12, 2019

I am using homebrew cask julia version 1.1.0 on macOS 10.13.6.

mkl · on Nov 12, 2019

I get 0 with Julia 1.1.1 on Ubuntu:

  julia> e(x) = x == 0 ? 1 : (exp(x) - 1) / x 
  e (generic function with 1 method)

  julia> q(x) = abs(x - sqrt(x^2 + 1)) - 1 / (x + sqrt(x^2 + 1))
  q (generic function with 1 method)

  julia> h(x) = e(q(x) ^ 2)
  h (generic function with 1 method)

  julia> for x in (15., 16., 17., 9999.)
             println(h(x))
         end
  0.0
  0.0
  0.0
  0.0

I get the same in Python. They're all using standard hardware 64 bit floating point numbers in the same way.

SloopJon · on Nov 12, 2019

One thing I notice about the C program is that it uses the double-precision exp(), fabs(), and sqrt(), rather than the "q"-suffixed variants from quadmath.h, a feature that appears to be specific to GCC. However, changing the program had no effect for me with GCC 9.2.

I can't get the C++ program to compile.

Edit: the post has been rewritten since I saw it.

shultays · on Nov 12, 2019

There are many posts about floating point limitations and this is the most absurd one.

Yes, floating point represents irrational numbers as an approximation and each operation would further deviant you from the real values you would expect.

he could just say "`float a = M_PI;` is not actually equal to pi!"

pavanky · on Nov 12, 2019

The post no longer exists?

jandrese · on Nov 12, 2019

https://web.archive.org/web/20191112011511/https://www.johnd...

colanderman · on Nov 12, 2019

The original post was missing return types for the C++ functions, thus defaulting to int. Adding correct return types (T) resulted in the same output as the C version (0), thus eliminating the discrepancy and making the title nonsensical, hence the edit.

-Wall -Wextra -Werror is your friend.

saagarjha · on Nov 12, 2019

Since floating point operations are not associative, I wonder if it is legal for compilers to rearrange them. Maybe that’s what’s happening here?

WalterBright · on Nov 12, 2019

Even if it is legal to rearrange them, the compiler should not unless it is guaranteed to get the same result.

thethirdone · on Nov 12, 2019

As far as I know, floating point add and multiply are commutative (but not associative) in IEEE 754. And I believe C and C++ allow you to detect if IEEE 754 is used by the hardware and if so, execute the floating point expressions as written.

Note: That is of course no longer true under -ffast-math.

cperciva · on Nov 12, 2019

As far as I know, floating point add and multiply are commutative (but not associative) in IEEE 754.

Usually, but not always. (+0) + (-0) yields (+0), while (-0) + (+0) yields (-0).

thethirdone · on Nov 12, 2019

In IEEE 754-1985 and IEEE 754-2008 this is not valid behavior. I have mostly worked on the 2008 revision, so I didn't know if 0 was commutative in the original standard. Thanks for making me look it up.

IEEE 754-2008

> Section 6.3 The sign bit

> When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0. However, x + x = x − (−x) retains the same sign as x even when x is zero.

No quote for IEEE 754-1985 because I didn't want to type it out and it is mostly the same.

cperciva · on Nov 12, 2019

Huh, interesting. I've definitely seen systems where the sum of two zeroes took the sign of the first value. I assumed they were inheriting 754 behaviour but I guess not.

saagarjha · on Nov 12, 2019

I can't reproduce this: https://godbolt.org/z/PK4gvZ

tlb · on Nov 12, 2019

It works if the compiler can't do constant folding, like so: https://godbolt.org/z/nQDrSW

xscott · on Nov 12, 2019

Your first line of main is adding -0.0 and -0.0, which is not what I think you intended to show.

tlb · on Nov 12, 2019

Oops, you're right. https://godbolt.org/z/pAwqUA shows it better. Adding a -0 and +0 together in either order gives +0, using the actual x86-64 addsd instruction.

elcaminocomplex · on Nov 12, 2019

Can you cite any references about this? I don’t recall anything like this at all unless perhaps it’s specific to FENV_ACCESS and non-default rounding modes.

thethirdone · on Nov 12, 2019

In a sister comment, I quoted IEEE 754-2008 Section 6.3 which shows that 0 is commutative in the newer standard. The older revision has similar language.

saagarjha · on Nov 12, 2019

Oops, I meant associative. Sorry about that; I fixed it above.

vesinisa · on Nov 12, 2019

This post does not seem to compare C++ and C, but C++ and bc arbitrary precision calculator?

colanderman · on Nov 12, 2019

He edited it after fixing an error in the C++ version. (See my comment elsewhere in this thread.)

johndcook · on Nov 12, 2019

Right. Sorry for the confusion. The post hit HN while I was editing.

rurban · on Nov 12, 2019

@dang please rename title to "Floating point oddity" as in the linked post. It's not about C vs C++.

kccqzy · on Nov 12, 2019

    template <typename T> T q(T x) {
      return fabs(x - sqrt(x*x + 1.)) - 1./(x + sqrt(x*x + 1.));
    }

The author doesn't seem to have a lot of numeric experience. He probably doesn't know that, even when T is float, the compiler uses the double version of sqrt.

Don't believe me? Read the assembler:

    0000000000000c3c <float q<float>(float)>:
     c3c:   55                      push   rbp
     c3d:   48 89 e5                mov    rbp,rsp
     c40:   48 83 ec 20             sub    rsp,0x20
     c44:   f3 0f 11 45 fc          movss  DWORD PTR [rbp-0x4],xmm0
     c49:   f3 0f 5a 5d fc          cvtss2sd xmm3,DWORD PTR [rbp-0x4]
     c4e:   f2 0f 11 5d f0          movsd  QWORD PTR [rbp-0x10],xmm3
     c53:   f3 0f 10 45 fc          movss  xmm0,DWORD PTR [rbp-0x4]
     c58:   f3 0f 59 45 fc          mulss  xmm0,DWORD PTR [rbp-0x4]
     c5d:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
     c61:   f2 0f 10 0d 4f 02 00    movsd  xmm1,QWORD PTR [rip+0x24f]        # eb8 <std::piecewise_construct+0x8>
     c68:   00
     c69:   f2 0f 58 c1             addsd  xmm0,xmm1
     c6d:   e8 5e fc ff ff          call   8d0 <sqrt@plt>
     c72:   f2 0f 10 5d f0          movsd  xmm3,QWORD PTR [rbp-0x10]
     c77:   f2 0f 5c d8             subsd  xmm3,xmm0
     c7b:   66 0f 28 c3             movapd xmm0,xmm3
     c7f:   f3 0f 7e 0d 39 02 00    movq   xmm1,QWORD PTR [rip+0x239]        # ec0 <std::piecewise_construct+0x10>
     c86:   00
     c87:   66 0f 54 c1             andpd  xmm0,xmm1
     c8b:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
     c90:   f3 0f 5a 65 fc          cvtss2sd xmm4,DWORD PTR [rbp-0x4]
     c95:   f2 0f 11 65 e8          movsd  QWORD PTR [rbp-0x18],xmm4
     c9a:   f3 0f 10 45 fc          movss  xmm0,DWORD PTR [rbp-0x4]
     c9f:   f3 0f 59 45 fc          mulss  xmm0,DWORD PTR [rbp-0x4]
     ca4:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
     ca8:   f2 0f 10 0d 08 02 00    movsd  xmm1,QWORD PTR [rip+0x208]        # eb8 <std::piecewise_construct+0x8>
     caf:   00
     cb0:   f2 0f 58 c1             addsd  xmm0,xmm1
     cb4:   e8 17 fc ff ff          call   8d0 <sqrt@plt>
     cb9:   f2 0f 58 45 e8          addsd  xmm0,QWORD PTR [rbp-0x18]
     cbe:   f2 0f 10 0d f2 01 00    movsd  xmm1,QWORD PTR [rip+0x1f2]        # eb8 <std::piecewise_construct+0x8>
     cc5:   00
     cc6:   f2 0f 5e c8             divsd  xmm1,xmm0
     cca:   66 0f 28 c1             movapd xmm0,xmm1
     cce:   f2 0f 10 55 f0          movsd  xmm2,QWORD PTR [rbp-0x10]
     cd3:   f2 0f 5c d0             subsd  xmm2,xmm0
     cd7:   66 0f 28 c2             movapd xmm0,xmm2
     cdb:   f2 0f 5a c0             cvtsd2ss xmm0,xmm0
     cdf:   c9                      leave
     ce0:   c3                      ret

And all you need to know is that the call instruction is calling sqrt@plt instead of sqrtf@plt.

And if you don't believe the compiler either, I'll quote this sentence from the C++11 standard, section 2.14.4 [lex.fcon]:

> The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double.

So if you do `sqrt(x*x + 1.)` the right operand of plus is double, therefore the argument to sqrt is double, and you get the double version.

acqq · on Nov 12, 2019

Just using templates with floating point numerical computation is a sure red flag. One has to adjust the code to the limits imposed by each internal representation.

Floating point code is not to be written without figuring out much more than copying some math formula from somewhere.

In this case the article maybe attempts to demonstrate that fact, but the discussion of what's actually happening is surely lacking.

mkl · on Nov 12, 2019

> Just using templates with floating point numerical computation is a sure red flag.

What do you think of Eigen, then? It seems to do pretty well, and is very template heavy.

acqq · on Nov 12, 2019

I’m not claiming it’s impossible in general, just that it’s much more probable that the limits have to be defined explicitly for different sizes. Template speziasiations etc.

mkl · on Nov 12, 2019

> The author doesn't seem to have a lot of numeric experience.

What a strange statement. Even if you haven't seen John D Cook's articles on HN before [1], it's just a couple of clicks to find out more about him and what he does, e.g.

https://www.johndcook.com/blog/applied-math/

https://www.johndcook.com/blog/notes/#numerical

[1] There are a lot! https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

kccqzy · on Nov 12, 2019

Not strange at all. Looking at the links you posted, I suppose the author has probably an immense understanding of theoretical aspects of numeric computing, but still utterly lacks the experience of a down-to-the-earth engineer churning out working code fast.

DerekL · on Nov 12, 2019

That's true, but it has nothing to do with the main point of the post, which is that that implementation of the q function will almost always give bad results.

DoctorOetker · on Nov 12, 2019

somewhat related question: how do I most easily verify / disprove that

(a+b)-b == a exactly

for a,b floating point numbers?

(i.e. I don't want the compiler to optimize the LHS away to a...)

Doxin · on Nov 12, 2019

You could prevent the compiler from optimising it by either running with optimisations turned off or by declaring a and b as volatile. Declaring a variable as volatile tells the compiler that something external to the program (and indeed even external to the processor or memory!) might change the value of that variable at any time. In this case that's not actually true but it should stop the compiler from optimising the variable access in any case.

The easiest case to disprove your proposed invariant is to set a to 0 and b to NaN. The output is NaN and not 0.

geoalchimista · on Nov 12, 2019

I think it is well known that floating-point numbers have their limitations. How would the IEEE 754 floating-point system know when to use L'Hôpital's rule? If you know there is a singularity (0 / 0 or inf / inf) then you don't have to use floating-point; use calculus instead.

Edit: What I want to say is that this is not a problem of floating-point number (not C or C++ either). It works _as intended_. Trying to increase the precision for evaluating (exp(x) - 1) / x at x = 0 is like beating a dead horse. Instead, a robust approach would be to expand (exp(x) - 1) / x at x = 0 as a Taylor series for very small x:

1 + x/2 + x^2/6 + x^3/24 + x^4/120 + O(x^5)