Hacker News new | past | comments | ask | show | jobs | submit login
Improving the fast inverse square root (2010) (wz.cz)
155 points by cpdt on July 9, 2018 | hide | past | favorite | 55 comments



Pretty much off topic, but Řrřola, the author of this blog post, also makes mind blowing 256 byte demos.

E.g. Puls from 2009: https://www.pouet.net/prod.php?which=53816 (check the youtube link if you don't have an MS-DOS ready)

I understand little about extreme sizecoding, but I suspect it's a similarly obsessed mathy story as this blog post, to double use the same bytes as code and content in a way that things actually work and look great.


From the author:

What is it: implicit surfaces raymarching using binary search. The shapes get "blown up" according to step size, which fakes the ambient occlusion feel (also necessary for the bisection to work). Color is the number of missed probes minus log(last step size), which had the most bearable artifacts.

- implicit surfaces are surfaces defined as the solution of an equation f(x,y,z)=0

- raymarching is a raytracing technique where you advance step by step along the ray, it is a very common technique for sizecoding. The rest of the description detail the rendering tricks used for shading and coloring.

The "content" does not "use the same byte as code", it is code, in the form of the implicit surface equation.


I heard some Nintendo games does that with sprites or sounds that can take on a random-ish look. Very very cool.


Yars' Revenge on the Atari 2600 used the game code as random input to generate the graphics for the 'safe zone'


It is worth noting that with AVX-512, Intel has introduced a native inverse sqrt approximation (VRSQRT14).


Inverse sqrt approximation is available since SSE1 with rsqrtss & rsqrtps instructions.


Which is nice because SSE1 and SSE2 are mandatory parts of x86_64. If you're a 64bit application for desktop, you can use rsqrtss without any checks or fallbacks.

Unfortunately, it doesn't tend to get used automatically in languages like C. The result of rsqrtss is slightly different from 1/sqrtf(x) as two seperate operations, so it cannot be applied as an optimization.

If the rules for floating point optimization are loosened by passing -ffast-math to GCC, the compiler will use it. That being said, -ffast-math is a shotgun that affects a lot of things. If you need signed zeros, Infs, NaNs or denormals that flag may break your program.


> -ffast-math is a shotgun that affects a lot of things

Interesting point. GCC and MSVC both seem to have (incompatible) intrinsic functions, for what that's worth.

https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/X86-Built-in-Fu...

https://docs.microsoft.com/en-us/previous-versions/visualstu...


That is not quite true. You don't need to use GCC __builtin functions for this. GCC supports SSE1 intrinsics like _mm_rsqrt_ss exactly same as MSVC - it is declared in xmmintrin.h header. Just include it and _mm_rsqrt_ss/ps will be available for you in gcc and msvc.


I find it quite fortunate, that they don't use it automatically. Introducing a 1e-3 relative error is quite a deal breaker for some. Not for games sure, but for science that is mostly unacceptable.


From memory, GCC does one NewtonRaphson iteration on the approximate result so the error is much lower (closer to e-9 from memory again). They don't use the approximation directly in fast-math mode.


this is wildly off topic, but can anyone from either the scientific or the graphics community comment on the practical impact of losing denorms? i certainly understand that it softens the impact of underflow, but does anyone care?


Indeed.

Both reciprocal (inverse) square root SSE SIMD instructions were available in Intel Pentium III, released in 1999.


Ah, I should have done more googling. I guess the AVX-512 ones are marginally more accurate?


VRSQRT28 too, which has max 2^-28 rel error.

https://software.intel.com/en-us/articles/reference-implemen...


Thanks, I came here to ask the similar question about native optimizations on this 'hack'. Apologies for my lack of knowledge, but I'm a little confused on which one to use out of all these variants while compiling C++ code on a 64 bit platform for the standard 'float' type inverse square root? Are there varying levels of compromise between speed and accuracy among all these methods? Thanks ...


For a generic 64b platform, use RSQRTSS/RSQRTPS, since it's the only one that will exist. The others are specific to rather new hardware.

My recollection is that it's accurate to 11.5 bits, so after one refinement step you have nearly full precision (an error bound of a couple ULP). Check Intel's docs for more details.


Thanks!


Note that VRSQRT28 is in AVX-512ER, which is Xeon Phi only.


How does that perform in comparison?


rsqrt{p,s}s has guaranteed relative error <= 1.5 * 2^-12, or about 3.6e-4. According to Agner Fog, it typically executes in one cycle. I would assume that the AVX512 versions are similar.


How is it that inverse seems to be used as "multiplicative inverse" in this context? It seems like a really ambiguous term, because it could also be interpreted as either:

inverse of the square root (which is just the squaring operation), or

the inverse of some other binary operator, like addition or anything else...


I think you’ve hit the nail on the head:

> it could also be interpreted as ... [the] inverse of the square root (which is just the squaring operation)

Since the other obvious interpretation is not very useful and has a clearer name—i.e. “the square”—the term “inverse square root” has only one useful meaning, which is therefore how it’s interpreted. (I don’t follow the second option about binary operators.) Mathematical terminology and notation in general are full of ambiguities which are resolved by extensive contextual knowledge. As noted by a sibling comment, calling it the reciprocal square root would be clearer.


A better phrase would be "reciprocal square root".


Yes, "inverse" could perhaps be more clearly stated as "inverted". I've heard the term used this way before, but it's not common.


It is the inverse of the square root. If you want to normalise a vector, you divide the components by the length. The length is the sqrt of the sum squares (Pythagoras). Divide is more expensive than multiply. So get the inverse sqrt of the sum of the squared components, then multiply the components by the inverse sqrt.


The point is that "inverse" usually refers to the function that reverses the effect of the original function, i.e.

  f_inv(f(x)) = x for all x in Domain(f)
g(x) = 1 / sqrt(x) is not the inverse of f(x) = sqrt(x) in this sense.


It seems like this does not work for denormal floats.


That's great and all, but nobody needs a 32-bit anything in 2018. This undergraduate paper provides a magic number and associated error bound for 64-bit doubles:

https://cs.uwaterloo.ca/~m32rober/rsqrt.pdf


That's not really accurate. Even in cases were 32 bit and 64 bit operations are equally fast on the CPU, 32 bit values still take up half the memory. For many workloads, the limiting factor is cache space. So, if you can use 32 but values, you can get much better performance for those workloads.


And if you’re doing heavy floating point work, you can fit twice as many operations in with a 32-bit float vector as an equally sized double vector, and The vectorized operations happen roughly as fast for both forms, yielding an approximate doubling of speed.


for rank-2 tensor work you can do 4x as many operations, for rank-3 tensor work, it's 8x, assuming memory bandwidth is the bottleneck.


Does that mean it’s 64x as fast for 16-bit floating point vs 64-bit for a rank 3 tensor?


assuming 1) memory bandwidth is the bottleneck and 2) you can keep the tensor values in cache or registers.

I think that GPUs are still vector processing engines, so they should scale with 4x... But assuming google architected the TPU correctly, it should be 16x as fast (I think the architecture is actually that of a rank-2 tensor).


This "nobody needs a 32-bit anything in 2018" seems like a weird opposite of "640K should be enough for anyone".

https://www.wired.com/1997/01/did-gates-really-say-640k-is-e...


Lots of ML and AI applications are using ever-smaller precisions. Half and even quarter-precision floats are able to maximize efficiency of the various CPU/GPU ALUs.


I was going to mention that... Just because we have ridiculous transistor budgets don't mean there aren't problems where you need/want to push the envelope for performance instead of precision. If anything, it grows the applicable problem space.


> That's great and all, but nobody needs a 32-bit anything in 2018.

Then why x86-64 integer instructions default to 32-bit register size when REX prefix byte is not present?

You can double x86 FP throughput using 32-bit floats versus 64 bit ones.

For GPUs, the performance 32-bit float performance advantage can be more than 4-10x (sometimes a lot more).


TIL, no one in the gaming industry uses 32 bit floats any longer. /s


Funny, in 2018 a lot of people are asking for 16-bit floats.

https://en.wikipedia.org/wiki/Half-precision_floating-point_...


This is not true. In games 32-bit floats are extremely common.


Nobody needs absolutes in 2018.


Even scientific calculation would be fine with 32 bit floats, but average floating point error due to representation creeps with ON (iirc) over N multiplications, so you have to use 64 bit for many scientific applications to get satisfactory results after a million or a trillion multiplications.


Not really - https://en.wikipedia.org/wiki/Numerical_stability

If your algorithm is not stable then even 64-bit won't help you.

Compare Euler vs Verlet - https://en.wikipedia.org/wiki/Verlet_integration


You're making a different argument.


Which problem that has stable algorithm would require 64-bit then?


What they typically do in 3d gaming is update the matrix that holds the transformation by a left multiplication, every time the camera changes. So

Tn = U_{n-1} * U_{n-2} * .... * U_0 * T_0

After a while,your matrix accumulates errors, but it's easy to just start and take a fresh one.


> Even scientific calculation would be fine with 32 bit floats

It really depends on the algorithms in question and the error tolerances.


deep learning uses low precision floats

sometimes as few as 8 bits are needed


I think gen 1 or gen 2 of the TPU explicitly supported short ints.


Realtime 3D still uses floats, but only when we can afford something so big, s10e5 is better where available.


I think this article is from 2010.


I wanted to put over a billion floats in a numpy array just a few months ago. Making them 16-bit saved a lot of memory.

It doesn't matter how much resource limits increase, people are going to keep hitting them. And when they hit them, using a smaller data type will always help.


That's not relevant there are plenty of single precision float applications today (and many fixed point applications as well). It all depends on your workload.


> nobody needs a 32-bit anything in 2018

Tell us more about this strange "2018" place!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: