Hacker News new | past | comments | ask | show | jobs | submit login
Floating point oddity (johndcook.com)
122 points by furcyd on Nov 12, 2019 | hide | past | favorite | 50 comments



> Incidentally, bc -l gives the right result, no matter how small you set scale.

There's an easy explanation here: bc has it's own arbitrary precision number library embedded in it, so it doesn't make use of floats, doubles or even quads at all.

The number library is surprisingly readable [1] for such a complex topic, and doesn't use complex types at all, relying on ints and chars and pointers inside the struct it uses to store the numbers.

It's been wrapper up for a couple languages, I stumbled over it with a Lua wrapper [0]. Useful when you want better precision than you currently have, but don't want to drag in something like GMP and are okay with minor edge cases sometimes appearing.

[0] https://webserver2.tecgraf.puc-rio.br/~lhf/ftp/lua/#lbc

[1] src/number.c & src/number.h from https://www.gnu.org/software/bc/


This has nothing to do with C vs. C++. It's a fundamental limitation of floating point arithmetic. Here is the Taylor series of \exp:

\exp(x) = 1 + x + higher order terms.

Sometimes you can represent x by a floating number but 1+x will be 1 because the difference has to go from exponent into mantissa. This happens when the number is between these two values:

(std::numeric_limits<T>::lowest, std::numeric_limits<T>::epsilon)


I thought there was a difference between a C and C++ implementation, but it was an error on my part. When I fixed the bug, I changed the post title. But the post hit HN while I was in the process of fixing it.


How does expm1 not work for you? It works for me.


tl;dr: expm1 does work. Something funny is going on if it doesn't work for you.

~~~

How this works out in practice is that once you try your hands on some serious floating-point, you quickly learn that an expression like x == 0 (or even like x <= 0) is a strong code smell. It's an instant red flag that you need to drop what you're doing and think things through extremely carefully first. So you'd stop right there and try to get some intuition for what's going on.

Now the intuition you want here is that floating point is logarithmically spaced around 0 rather than around 1, so if you do exp(x) for small x, it won't have adequate resolution to return anything but 1, because the next-highest jump is too large to be accurate. So when you subtract 1, you get zero.

But... that's precisely why expm1 exists: it's to evaluate exp(x) - 1 accurately. And in fact, it should work. And it does work for me. Why doesn't it work here? It sounds like either a bad implementation or the wrong compiler flags. We really need to know the compiler flags that are being used.

For what it's worth: floating-point can be a lot worse than this, where there's no simple red flag or library function like this that you can recognize. In my experience, a rule of thumb (somewhat of a yellow flag) is that, if you start doing nonlinear operations (even division), you gotta roll up your sleeves and get ready for some fun...


> once you try your hands on some serious floating-point, you quickly learn that an expression like x == 0 (or even like x <= 0) is a strong code smell.

Yes, specifically here is that comparison that surely should not be written as such:

    template <typename T> T e(T x) {
      return x == 0. ? 1. : (exp(x) - 1.)/x;
    }
and also:

https://en.cppreference.com/w/cpp/numeric/math/expm1


> x == 0 (or even like x <= 0) is a strong code smell.

Careful: 0.0f is defined by the standard to be 0x0. So such comparisons can be quite reasonable.

However trying for 0 as the result of a computation, as is used in this example is indeed problematic.

But there is plenty of reasonable code that initializes a float or double variable to 0.0 and can expect to be able to inspect it and find its bit pattern to be 0x0 before that variable is manipulated. After that, all bets are off.


The point isn't just that you can't tell exactly when something is zero, although indeed you often can't.

It's that if you have some calculation that needs to be special-cased at zero (e.g., because you're dividing by something and it might explode) then it probably actually needs to be special-cased near zero (because near zero it might overflow or run into other numerical problems), and it's probably worth seeing if there's another way of organizing the calculation that doesn't misbehave for small values in the first place.


>I tried replacing exp(x) - 1 with expm1(x). This made no difference.

Interestingly, in Rust, all four q(x) are non-zero, using f32::exp / f64::exp in e(x) consistently gives 0, and using f32::exp_m1 / f64::exp_m1 consistently gives 1. (Given non-zero q(x), this is to be expected.) Both f32::exp_m1 and f64::exp_m1 are implemented using libm's expm1 [1], so I wonder what the author's C++ compiler is doing instead.

https://play.rust-lang.org/?version=stable&mode=release&edit...

[1]: https://doc.rust-lang.org/src/std/f64.rs.html#724-726 + https://github.com/rust-lang/rust/blob/bc0e288ad02ef362b5a6c...

Edit: Actually, I can't reproduce the author's situation on godbolt. The expm1 method consistenly returns 1. https://gcc.godbolt.org/z/RvsdHU


There is an interesting paper that can automatically rewrite expressions to improve the floating point errors.

http://herbie.uwplse.org/


This looks... too good to be true. And indeed, I just tested it out on the quadratic formula, and the result doesn't seem to be terribly useful? At least in light of the problems that actually come up when trying to use the quadratic formula. (Which btw is quite a fun exercise to figure out if this doesn't ring a bell.)


I don't want to sound dismissive, but isn't this just a problem with numerical instability? How is this different from trying to solve the following system?

  x^2+(2+y)*x+1=2
  x^2+(2 -y)*x+1=3
And playing with the value of y.

I know that this example involves a sign change while the OP doesn't, but it's the same kind problem essentially, isn't it?

If I have learnt anything building PID loops on microcontrollers is that floating point can be a PITA if you are near this kind of situations.


isn't this just a problem with numerical instability?

Yes, but many developers aren't aware of numeric instability. Or, maybe they've heard the term, but not how it applies in practice.

People seem to go through stages when learning about floating point:

1. "When you need fractions, use floating point." At this stage, they expect "x / 3 * 3 == x" to always be true, and are surprised when things like that fail.

2. "Floating point is just approximate. So always use double, which is accurate to 15 decimal places." So this explains why you don't get exact answers, but of course doesn't explain the numerical instability in the OP's post.

3. "Cancellation can cause results to be arbitrarily far off! We need to analyze the error propagation of every expression!" This is impractical, and developers in stage 2 look at you like you're over engineering the system.

4. "In practice, using doubles 'just works' in almost all situations." If you're lucky, you can recognize when it might not work, and do error analysis only in those cases. If not, you just don't sleep well at night, wondering whether or where your code will break down. Probably some of both.


I learned about this back in the early 80's writing a transcendental runtime library (to be used on the F-16 fighter with Jovial). I was always afraid to read a story about some pilot in a dog fight launching a missile and having it go nuts and they were shot down because the code had issues.


I tried this with Julia and got the answer 1.

    e(x) = x == 0 ? 1 : (exp(x) - 1) / x
    q(x) = abs(x - sqrt(x^2 + 1)) - 1 / (x + sqrt(x^2 + 1))
    h(x) = e(q(x) ^ 2)
What's happening here?

Is Julia switching to arbitrary precision like bc -l or rewriting exp(x) - 1 to expm1(x) or doing something else? Seems a lot of magic to me.


Not sure what’s going on but Julia is definitely not switching to arbitrary precision without being explicitly asked to do so. That would be very unjulian.

I’m not at a computer or I’d investigate. One hypothesis based on some other comments is that getting the wrong answer (especially when using expm1) may be due to unsafe C compiler flags, which Julia does not apply by default. Just a guess though.

Update from reading more comments: it seems that Rust and C via godbolt without questionable C flags also don’t fail like this. So I wonder if John hadn’t compiled the code with fast-math on, which will allow me the compiler to give all kinds of wrong answers.


I am using homebrew cask julia version 1.1.0 on macOS 10.13.6.


I get 0 with Julia 1.1.1 on Ubuntu:

  julia> e(x) = x == 0 ? 1 : (exp(x) - 1) / x 
  e (generic function with 1 method)

  julia> q(x) = abs(x - sqrt(x^2 + 1)) - 1 / (x + sqrt(x^2 + 1))
  q (generic function with 1 method)

  julia> h(x) = e(q(x) ^ 2)
  h (generic function with 1 method)

  julia> for x in (15., 16., 17., 9999.)
             println(h(x))
         end
  0.0
  0.0
  0.0
  0.0
I get the same in Python. They're all using standard hardware 64 bit floating point numbers in the same way.


One thing I notice about the C program is that it uses the double-precision exp(), fabs(), and sqrt(), rather than the "q"-suffixed variants from quadmath.h, a feature that appears to be specific to GCC. However, changing the program had no effect for me with GCC 9.2.

I can't get the C++ program to compile.

Edit: the post has been rewritten since I saw it.


There are many posts about floating point limitations and this is the most absurd one.

Yes, floating point represents irrational numbers as an approximation and each operation would further deviant you from the real values you would expect.

he could just say "`float a = M_PI;` is not actually equal to pi!"


The post no longer exists?



The original post was missing return types for the C++ functions, thus defaulting to int. Adding correct return types (T) resulted in the same output as the C version (0), thus eliminating the discrepancy and making the title nonsensical, hence the edit.

-Wall -Wextra -Werror is your friend.


Since floating point operations are not associative, I wonder if it is legal for compilers to rearrange them. Maybe that’s what’s happening here?


Even if it is legal to rearrange them, the compiler should not unless it is guaranteed to get the same result.


As far as I know, floating point add and multiply are commutative (but not associative) in IEEE 754. And I believe C and C++ allow you to detect if IEEE 754 is used by the hardware and if so, execute the floating point expressions as written.

Note: That is of course no longer true under -ffast-math.


As far as I know, floating point add and multiply are commutative (but not associative) in IEEE 754.

Usually, but not always. (+0) + (-0) yields (+0), while (-0) + (+0) yields (-0).


In IEEE 754-1985 and IEEE 754-2008 this is not valid behavior. I have mostly worked on the 2008 revision, so I didn't know if 0 was commutative in the original standard. Thanks for making me look it up.

IEEE 754-2008

> Section 6.3 The sign bit

> When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0. However, x + x = x − (−x) retains the same sign as x even when x is zero.

No quote for IEEE 754-1985 because I didn't want to type it out and it is mostly the same.


Huh, interesting. I've definitely seen systems where the sum of two zeroes took the sign of the first value. I assumed they were inheriting 754 behaviour but I guess not.


I can't reproduce this: https://godbolt.org/z/PK4gvZ


It works if the compiler can't do constant folding, like so: https://godbolt.org/z/nQDrSW


Your first line of main is adding -0.0 and -0.0, which is not what I think you intended to show.


Oops, you're right. https://godbolt.org/z/pAwqUA shows it better. Adding a -0 and +0 together in either order gives +0, using the actual x86-64 addsd instruction.


Can you cite any references about this? I don’t recall anything like this at all unless perhaps it’s specific to FENV_ACCESS and non-default rounding modes.


In a sister comment, I quoted IEEE 754-2008 Section 6.3 which shows that 0 is commutative in the newer standard. The older revision has similar language.


Oops, I meant associative. Sorry about that; I fixed it above.


This post does not seem to compare C++ and C, but C++ and bc arbitrary precision calculator?


He edited it after fixing an error in the C++ version. (See my comment elsewhere in this thread.)


Right. Sorry for the confusion. The post hit HN while I was editing.


@dang please rename title to "Floating point oddity" as in the linked post. It's not about C vs C++.


    template <typename T> T q(T x) {
      return fabs(x - sqrt(x*x + 1.)) - 1./(x + sqrt(x*x + 1.));
    }
The author doesn't seem to have a lot of numeric experience. He probably doesn't know that, even when T is float, the compiler uses the double version of sqrt.

Don't believe me? Read the assembler:

    0000000000000c3c <float q<float>(float)>:
     c3c:   55                      push   rbp
     c3d:   48 89 e5                mov    rbp,rsp
     c40:   48 83 ec 20             sub    rsp,0x20
     c44:   f3 0f 11 45 fc          movss  DWORD PTR [rbp-0x4],xmm0
     c49:   f3 0f 5a 5d fc          cvtss2sd xmm3,DWORD PTR [rbp-0x4]
     c4e:   f2 0f 11 5d f0          movsd  QWORD PTR [rbp-0x10],xmm3
     c53:   f3 0f 10 45 fc          movss  xmm0,DWORD PTR [rbp-0x4]
     c58:   f3 0f 59 45 fc          mulss  xmm0,DWORD PTR [rbp-0x4]
     c5d:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
     c61:   f2 0f 10 0d 4f 02 00    movsd  xmm1,QWORD PTR [rip+0x24f]        # eb8 <std::piecewise_construct+0x8>
     c68:   00
     c69:   f2 0f 58 c1             addsd  xmm0,xmm1
     c6d:   e8 5e fc ff ff          call   8d0 <sqrt@plt>
     c72:   f2 0f 10 5d f0          movsd  xmm3,QWORD PTR [rbp-0x10]
     c77:   f2 0f 5c d8             subsd  xmm3,xmm0
     c7b:   66 0f 28 c3             movapd xmm0,xmm3
     c7f:   f3 0f 7e 0d 39 02 00    movq   xmm1,QWORD PTR [rip+0x239]        # ec0 <std::piecewise_construct+0x10>
     c86:   00
     c87:   66 0f 54 c1             andpd  xmm0,xmm1
     c8b:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
     c90:   f3 0f 5a 65 fc          cvtss2sd xmm4,DWORD PTR [rbp-0x4]
     c95:   f2 0f 11 65 e8          movsd  QWORD PTR [rbp-0x18],xmm4
     c9a:   f3 0f 10 45 fc          movss  xmm0,DWORD PTR [rbp-0x4]
     c9f:   f3 0f 59 45 fc          mulss  xmm0,DWORD PTR [rbp-0x4]
     ca4:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
     ca8:   f2 0f 10 0d 08 02 00    movsd  xmm1,QWORD PTR [rip+0x208]        # eb8 <std::piecewise_construct+0x8>
     caf:   00
     cb0:   f2 0f 58 c1             addsd  xmm0,xmm1
     cb4:   e8 17 fc ff ff          call   8d0 <sqrt@plt>
     cb9:   f2 0f 58 45 e8          addsd  xmm0,QWORD PTR [rbp-0x18]
     cbe:   f2 0f 10 0d f2 01 00    movsd  xmm1,QWORD PTR [rip+0x1f2]        # eb8 <std::piecewise_construct+0x8>
     cc5:   00
     cc6:   f2 0f 5e c8             divsd  xmm1,xmm0
     cca:   66 0f 28 c1             movapd xmm0,xmm1
     cce:   f2 0f 10 55 f0          movsd  xmm2,QWORD PTR [rbp-0x10]
     cd3:   f2 0f 5c d0             subsd  xmm2,xmm0
     cd7:   66 0f 28 c2             movapd xmm0,xmm2
     cdb:   f2 0f 5a c0             cvtsd2ss xmm0,xmm0
     cdf:   c9                      leave
     ce0:   c3                      ret
And all you need to know is that the call instruction is calling sqrt@plt instead of sqrtf@plt.

And if you don't believe the compiler either, I'll quote this sentence from the C++11 standard, section 2.14.4 [lex.fcon]:

> The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double.

So if you do `sqrt(x*x + 1.)` the right operand of plus is double, therefore the argument to sqrt is double, and you get the double version.


Just using templates with floating point numerical computation is a sure red flag. One has to adjust the code to the limits imposed by each internal representation.

Floating point code is not to be written without figuring out much more than copying some math formula from somewhere.

In this case the article maybe attempts to demonstrate that fact, but the discussion of what's actually happening is surely lacking.


> Just using templates with floating point numerical computation is a sure red flag.

What do you think of Eigen, then? It seems to do pretty well, and is very template heavy.


I’m not claiming it’s impossible in general, just that it’s much more probable that the limits have to be defined explicitly for different sizes. Template speziasiations etc.


> The author doesn't seem to have a lot of numeric experience.

What a strange statement. Even if you haven't seen John D Cook's articles on HN before [1], it's just a couple of clicks to find out more about him and what he does, e.g.

https://www.johndcook.com/blog/applied-math/

https://www.johndcook.com/blog/notes/#numerical

[1] There are a lot! https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


Not strange at all. Looking at the links you posted, I suppose the author has probably an immense understanding of theoretical aspects of numeric computing, but still utterly lacks the experience of a down-to-the-earth engineer churning out working code fast.


That's true, but it has nothing to do with the main point of the post, which is that that implementation of the q function will almost always give bad results.


somewhat related question: how do I most easily verify / disprove that

(a+b)-b == a exactly

for a,b floating point numbers?

(i.e. I don't want the compiler to optimize the LHS away to a...)


You could prevent the compiler from optimising it by either running with optimisations turned off or by declaring a and b as volatile. Declaring a variable as volatile tells the compiler that something external to the program (and indeed even external to the processor or memory!) might change the value of that variable at any time. In this case that's not actually true but it should stop the compiler from optimising the variable access in any case.

The easiest case to disprove your proposed invariant is to set a to 0 and b to NaN. The output is NaN and not 0.


I think it is well known that floating-point numbers have their limitations. How would the IEEE 754 floating-point system know when to use L'Hôpital's rule? If you know there is a singularity (0 / 0 or inf / inf) then you don't have to use floating-point; use calculus instead.

Edit: What I want to say is that this is not a problem of floating-point number (not C or C++ either). It works _as intended_. Trying to increase the precision for evaluating (exp(x) - 1) / x at x = 0 is like beating a dead horse. Instead, a robust approach would be to expand (exp(x) - 1) / x at x = 0 as a Taylor series for very small x:

1 + x/2 + x^2/6 + x^3/24 + x^4/120 + O(x^5)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: