Hacker News new | past | comments | ask | show | jobs | submit login
Cryptic genetic variation in software: hunting a buffered 41-year-old bug (cryptogenomicon.wordpress.com)
175 points by mineo on Oct 15, 2014 | hide | past | favorite | 47 comments



That was a really interesting read, and very well written. I wonder if anyone can clear this up though...

I find the terminology of open and closed intervals contradictory to their meaning. Does anyone know why they are described like this?

`Closed` makes me think shut or not-including - however it includes its endpoints. `Open` makes me think inclusive - yet does not include its endpoints.


From a real analysis standpoint... the definition of a 'closed set' in an N-dimensional metric space (of which Euclidean space, i.e. normal space, is an example) is as follows: a set C is 'closed' if and only if, given any sequence of elements (x_n) converging on x, such that (x_n) is a subset of C, it follows that x is also in C.

Under this definition, 'closed' makes sense in the larger context, since in mathematics, if an operation is 'closed' on a set, that means that applying the operation to elements of the set always yields another element of the same set.

Open is a bit more awkward. A set O is 'open' if and only if the complement of O (i.e. the set of all points not in O) is closed. So open is kind of the opposite of closed, hence the convention. Of course, it isn't really the opposite of closed, since there are sets which are neither closed nor open.

tt;dr (too technical; didn't read) in maths, 'closed' usually means 'I can do stuff inside this set without falling out of it'. In the case of intervals, the 'stuff' in question is taking a limit of a convergent sequence.


You can also have sets that are simultaneously closed and open.


Yep. For example, the empty set or the entire real number line.


Out of curiosity, do you have any examples or references to something with examples of sets that are neither open nor closed?


The set of rational numbers that lie inside the interval [0,1].

This set is not closed there are non-rational numbers in that interval which are limit points of sequences that consist only of rationals. For example, any of the algebraic numbers. I think that all real numbers are limits of such sequences, but I might be mis-remembering some subtlety of Dedekind Cuts (one method for constructing the Reals).

This set is not open because any rational is the limit of a sequence of non-rational reals. This probably makes intuitive sense, but just for the sake of formality: To construct such a sequence for any rational r, start with the number x_1 = 1/pi, and approach by a factor of 1/pi at each step, i.e. x_n+1 = x_n + (r-x_n)/pi . x_n is irrational because pi is transcendental.

Any simple interval in R will be either closed or open on each end (but it could be closed on one and open on the other). It's more illustrative to create a set with a non-compact interior. In higher dimensions it's possible to have more exotic borders on an interval, but I think that border will just end up being isomorphic to a non-compact set in a lower dimension.


The simplest example I can think of is the interval (0, 1].

Proof that it's not closed: the sequence (1, 1/2, 1/4, 1/8, ...) is entirely inside the interval, but converges on 0, which is outside the interval, therefore etc.

Proof that it's not open: the sequence (2, 3/2, 5/4, 9/8, ...) is entirely outside the interval (i.e. inside the complement), but it converges on 1, which is inside the interval (i.e. outside the complement). Thus the complement of the interval is not closed, therefore etc.


[0,1) = {x | 0<=x<1} is probably the simplest example.


Also, it is possible for the opposite to occur. So-called clopen sets are both open and closed.

http://en.m.wikipedia.org/wiki/Clopen_set


The analogy I've always used is that the closed interval has what amounts to a lid or a cap on it, while the open interval does not. Another (sort of related) way of looking at it is that the closed interval has a maximum (or minimum) value, whereas the open interval, despite only missing a single value, suddenly has this feeling of continuation, because without that final value it now asymptotes to the end and you will always be able to find a larger (or smaller) value than any previous value you examine.


The so-called "french notation" is interesting and — I think — clearer: a closed interval is [a, b] and an open interval is ]a, b[.


I learned Math in a french-system school and yes, that's definitely a much better notation in my opinion. The difference between [] and () is not immediately clear, whereas [a,b] versus ]a,b[ makes it obvious that one includes a and b while the other does not. It also makes it easy to remember "open" and "closed" and what they mean in terms of whether or not the interval bounds are included or excluded


I don't know if it's very intuitive given that a and be are still inside the ][. Maybe a]..[b would make it clearer?


    a[..]b


just a][b :)


I learned it one way in Middle School and the other in High School. What's perplexing is that the schools were literally next door to each other, and in the same school system too.


Even though this is also the way I think of it, this underscored to me the fact that you can make up plausible explanations for anything; even for words that are opposites and are trying to explain the same thing.


Thanks saurik, I think I understand that way of thinking.

With a closed interval it eventually return a bound instead of getting smaller/larger - continuing.


It's a good question. A close reading of the Wikipedia page <https://en.wikipedia.org/wiki/Interval_%28mathematics%29#Ter... tells me that an open interval has no definite maximum or minimum.

This no doubt sounds strange, but it surely comes down to the slightly odd behavior of real numbers. In an open interval like (0, 1), there is no single real number which is less than one but is greater than all other real numbers less than one. (Proof by Cantor's diagonal: write out a list of all such numbers and you can always construct another which wasn't on the list, even when the list is infinite.) In contrast, the closed interval [0, 1] has a definite maximum element, the number 1.

This seems like a good example of poor UX in mathematics; the terms are too similar and can only be learned by memorization, the notations likewise. Good luck changing it though; Wikipedia informs me that there are no less than two ISO standards codifying it.

There is a perfectly clear notation in the Wikipedia page, however: (a,b) = {x∈ℝ|a<x<b}; [a,b] = {x∈ℝ|a≤x≤b}


> real numbers. In an open interval like (0, 1), there is no single real number which is less than one but is greater than all other real numbers less than one.

The rational numbers are like this too.

All the rational numbers between 0 and 1 also increase arbitrarily close to 1 with no largest element. You can prove this by simple contradiction: For every rational number x < 1, there is another rational number y = (x+1)/2 with x < y < 1. So the set of rational numbers less than 1 also has no largest element.

Actually real numbers can be defined (axiomatized) in terms of the least upper bound property -- every set of real numbers that's bounded above has a least upper bound. So you could actually have a set S of rational numbers that gets arbitrarily close to something like sqrt(2) from below. S then has no rational least upper bound -- for every rational number x greater than or equal to every element of S, there is another rational number y greater than or equal to every element of S but smaller than x.

Note that a set of real numbers need not contain its upper bound -- as you noted, a bounded open interval doesn't contain its upper bound.


I don't see how this is poor UX. If we accept your statement that the terms "open" and "closed" are too similar, what do you suggest?

Also, any notation will have to be learned. Concise ones a bit more, but there is nothing one can do about that.

And that perfectly clear notation requires you to learn quite a bit by memorization, such as the meaning of those {} brackets, the |, and the symbols ∈ and ℝ.

Finally, you don't need Cantor's diagonalization for a proof. It is easy(1) to show that, for any real x<1, x+(1-x)/2 is real, less than one, and strictly greater than x.

Alternatively, assuming 0<x<1, write x in decimal, and increase the first digit less than nine by one to get a larger real less than one. There is such a digit because 0.9999999… is not less than one.

(1) depending on how deep you want to descend into the foundations of mathematics.


The way I wrap my head around this is "open" == "no defined beginning or end point" and "closed" == "precisely defined beginning/end point".

To take your thinking a bit further the interval is "shut" precisely because it's closed exactly at that point. And the interval is thrown "open" by not including the point.

BTW; Rudin's classic text (1) covers the fascinating topic of neighborhoods which builds on the concept of intervals. It took me a while to completely understand the concept but after that understanding limits was relatively easier; even otherwise "neighborhood" as a concept is interesting in itself.

[1] http://www.math.boun.edu.tr/instructors/ozturk/eskiders/guz1...


My word, I'd have to do some studying to get through that!

Thanks for everyone's input. I think I have a clear understanding of the open/closed paradigm now.


Think of Tom&Jerry. Open is when Tom is running on a carpet towards an open door, but the carpet is moving under him and he never reaches the door. Closed is when Tom slams into a closed door and gets flattened.


I always cringe when I see or have to write:

    for ( ; ; )
or:

    while ( true )
etc.

You just know you're setting yourself up for undocumented bugs later.

I've been known to build "escape" vars into these like:

    int attempts = 1000;
    for(;attempts > 0;attempts--) {
        ** DO SOMETHING **
    }

    if (attempts <= 0) {
        ** BAIL ***
    }
Where 1000 or whatever number is a reasonable estimate of the function's need to loop x 10 or etc.


Really? I use "while (true)" pretty frequently in my code. In my mind, all loops are composed of these parts:

1. Some initialization (optional). 2. Some stuff you do each iteration (optional). 3. Check the exit condition and exit the loop (optional). 4. Some stuff you do each iteration (optional).

"while" loops nicely handle the case where 2 is empty. "do while" loops handle 4 being empty. "for" loops handle both 2 and 4 being non-empty but only really accommodate 4 being a single expression.

For any other case where 2 is non-empty and 4 is more than a single expression, I just use a "while (true)" with an "if (...) break;" in the body.


Burying the only way out in the loop body is more error-prone, especially for those who come after you. Putting an escape hatch in the loop setup makes it less likely that an operationally-problematic[0] infinite loop scenario will be triggered. This is particularly important when your code doesn't have a runtime environment that can easily kill runaway requests.

[0] Or utterly catastrophic. Think embedded system with limited debugging facilities. This kind of bug could manifest as the system completely locking up, and take much, much longer to track down than a "infinite loop detected!" message in an error log.


> Burying the only way out in the loop body is more error-prone, especially for those who come after you.

I don't prefer to exit from the middle of a loop, but if that's the most succinct way to implement the correct behavior, I'll do it. I think short code that exits from the middle is less error-prone than code that has to be more convoluted to put the exit at the top of bottom.

> Putting an escape hatch in the loop setup makes it less likely that an operationally-problematic[0] infinite loop scenario will be triggered.

An escape hatch is yet more code that has to be tested and debugged. What if the escape hatch triggers to early?

> [0] Or utterly catastrophic. Think embedded system with limited debugging facilities.

Sure, but unusual platforms require unusual coding styles. If I was targeting that I might adjust my practices.


> code that has to be more convoluted to put the exit at the top of bottom

Note I said only way out. The point is that if "the" (intended) exit can't reasonably be in the loop condition, then adding an escape hatch to have a second way out is useful.

Assuming C you can even just compress the whole thing to a for_x_tries(n) macro.

> What if the escape hatch triggers to early?

You'll get a helpful message and can bump the count. Easily-isolated errors are preferable to frozen/DoS'd systems.


I'm not sure why any competent programmer would use for(;;) or while(true) without explicitly including a break; statement.


I wonder what you will do when you see tail call optimization functions in prolog or lisp.


At least one, possibly two other bugs lurking in the implementation.

1) Algorithm FT says:

    1. Generate u. Store the first bit of u
    as a sign s (s=0 if u<1/2, s=1 if u>=1/2).
and yet the C code implements

    if ( u <= 0.5 ) s = 0.0;
    else s = 1.0;
2) I can't be sure of the following w/o access to doc. But i4_uni() says

    a uniform distribution over (1, 2147483562)
which, offhand, is suspicious. A distribution over positive integers would probably want to use all available values in a 32-bit signed int, so it would most likely end at 2^31 -1 which is 2147483647, and not the value given.


Good point about the C code. OTOH, 2147483562 is 1 less than a prime, which is unlikely to be an accident.


It would be safest for most rand() functions to omit both zero and one, unless a user was really sure they wanted otherwise. If we were generating real numbers, we'd never see precisely zero or one. The fact that we do is an artifact of limited precision. These boundary cases cause problems in common computations like u.log(u) or (1-u).log(1-u).


Interesting suggestion, but I'm not sure I agree with you.

You could use that argument about any of the random numbers that your rng returns. "Hey, 0.023 is infinitesimally unlikely to occur, so let's exclude that as well".

Generally when I use a real-valued rng, the numbers generated are meant to be 'representative' of what I should get from the distribution. And when I get e.g. 0.023, it kind-of means "0.023 and/or numbers near to 0.023". If I excluded 0.023 from the possible results of the rng, then I would have a 'hole' in my distribution 'in the region of' 0.023.

Maybe what you want is to go from the possible rnds:

0.000, 0.001, 0.002, 0.003, ... , 0.999

to:

0.0005, 0.0015, 0.0025, ... , 0.9995

whereas with your suggestion, you are under-representing the boundary numbers near to 0 and/or near to 1. Generally not a problem I guess, but what you are doing is, imo, 'wrong in theory', even though your high-precision floats will probably cover it up ok in almost all cases.

I agree with you about 0 and 1 causing problems when they are fed into other functions - e.g. generating Gaussian rngs by using InverseCumulativeGaussian(0.0) - a bug I wasted some time hunting down in my company's rng library. My view was that the developer who wrote the library did not understand the maths of what he was doing, rather than that the 0to1 random number generator was at fault.


It's pretty common in practice to want [0,1) (that is, 0 <= x < 1) from your random number generator. Never generating 0 would be a problem when the range is small and discrete - generating random letters for example.


If you're generating integers, or selecting from a discrete set with uniform probabilities, there's no reason to involve floating point at any point in the process.


Generating random numbers in (0,1) is mathematically different than generating random numbers, even from the same type of distribution, in [0,1), (0,1], or [0,1]. Equating all of these, as the article points out, rarely causes problems in practice, but it is nonetheless a conceptual error.

If you'll pardon me pulling out my soapbox for a moment, this kind of bug would be less common if the "programmers don't need math" attitude was not a successful meme in our culture. The notion of "between zero and one" may seem straghtforward and obvious, but is actually ambiguous. Perhaps, given the number of distinct floating point values that can be represented in the interval (0,1), it would hardly seem to matter what interpretation of "between zero and one" is used. But in mathematics you will never see such terminology used without a qualifying statement that clarifies* the meaning.

*E.g. "between 0 and 1, inclusive", "on the closed interval bounded by 0 and 1", "[0,1]", etc.


(Also posted 20 hours ago, also no comments.) (It occurs to me that if there are ever comments on this post, my comment will sound really confusing: let it be clear that this article is #1 currently, was posted 40 minutes ago, and there are no comments yet. ;P)

https://news.ycombinator.com/item?id=8453042


Well, this article is so thorough, detailed and well-written, that I don't know what to say other than "Wow, that was cool and very interesting!".

More stories like that?


I especially like the clever analogy with genetics. Douglas Hofstadter said something along the lines of analogy being more powerful when the two concepts it connects are otherwise very distant from each other.


Agreed; technical articles such as this can't come up often enough for my tastes.

I've never used ranlib; I hope the documentation can be cleaned up at least. On the other hand, anyone who has read A Deepness in the Sky will surely feel that it is unlikely in the long term.


Thumbs up for not using rand, but the assuming that MT is a golden bullet is not exactly scientific; one should just test few RNGs, it may come in that the code exploits some ultra-hidden hole in MT or that some much faster RNG works equally well. Also reproducibility of a stochastic code means that the code results lead to the same conclusions regardless of the seed, not that you get bit to bit identical output for the same seed. In case one assumes (only) the latter, it may end up in seed cherry picking, not-optimizing code because it would "break reproducibility" or not investigating the natural deviation of the results.


A great exposition, and worth it just for the introduction to "schrödinbug". http://en.wikipedia.org/wiki/Heisenbug#Related_terms


Really interesting read.

I wonder about the resolution: was snorm() reimplemented correctly, or was an RNG with the 'wrong' interval supplied?


Menaf


What is the purpose of this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: