I understand little about extreme sizecoding, but I suspect it's a similarly obsessed mathy story as this blog post, to double use the same bytes as code and content in a way that things actually work and look great.
What is it: implicit surfaces raymarching using binary search. The shapes get "blown up" according to step size, which fakes the ambient occlusion feel (also necessary for the bisection to work). Color is the number of missed probes minus log(last step size), which had the most bearable artifacts.
- implicit surfaces are surfaces defined as the solution of an equation f(x,y,z)=0
- raymarching is a raytracing technique where you advance step by step along the ray, it is a very common technique for sizecoding. The rest of the description detail the rendering tricks used for shading and coloring.
The "content" does not "use the same byte as code", it is code, in the form of the implicit surface equation.
Which is nice because SSE1 and SSE2 are mandatory parts of x86_64. If you're a 64bit application for desktop, you can use rsqrtss without any checks or fallbacks.
Unfortunately, it doesn't tend to get used automatically in languages like C. The result of rsqrtss is slightly different from 1/sqrtf(x) as two seperate operations, so it cannot be applied as an optimization.
If the rules for floating point optimization are loosened by passing -ffast-math to GCC, the compiler will use it. That being said, -ffast-math is a shotgun that affects a lot of things. If you need signed zeros, Infs, NaNs or denormals that flag may break your program.
That is not quite true. You don't need to use GCC __builtin functions for this. GCC supports SSE1 intrinsics like _mm_rsqrt_ss exactly same as MSVC - it is declared in xmmintrin.h header. Just include it and _mm_rsqrt_ss/ps will be available for you in gcc and msvc.
I find it quite fortunate, that they don't use it automatically. Introducing a 1e-3 relative error is quite a deal breaker for some. Not for games sure, but for science that is mostly unacceptable.
From memory, GCC does one NewtonRaphson iteration on the approximate result so the error is much lower (closer to e-9 from memory again). They don't use the approximation directly in fast-math mode.
this is wildly off topic, but can anyone from either the scientific or the graphics community comment on the practical impact of losing denorms? i certainly understand that it softens the impact of underflow, but does anyone care?
Thanks, I came here to ask the similar question about native optimizations on this 'hack'. Apologies for my lack of knowledge, but I'm a little confused on which one to use out of all these variants while compiling C++ code on a 64 bit platform for the standard 'float' type inverse square root? Are there varying levels of compromise between speed and accuracy among all these methods? Thanks ...
For a generic 64b platform, use RSQRTSS/RSQRTPS, since it's the only one that will exist. The others are specific to rather new hardware.
My recollection is that it's accurate to 11.5 bits, so after one refinement step you have nearly full precision (an error bound of a couple ULP). Check Intel's docs for more details.
rsqrt{p,s}s has guaranteed relative error <= 1.5 * 2^-12, or about 3.6e-4. According to Agner Fog, it typically executes in one cycle. I would assume that the AVX512 versions are similar.
How is it that inverse seems to be used as "multiplicative inverse" in this context? It seems like a really ambiguous term, because it could also be interpreted as either:
inverse of the square root (which is just the squaring operation), or
the inverse of some other binary operator, like addition or anything else...
> it could also be interpreted as ... [the] inverse of the square root (which is just the squaring operation)
Since the other obvious interpretation is not very useful and has a clearer name—i.e. “the square”—the term “inverse square root” has only one useful meaning, which is therefore how it’s interpreted. (I don’t follow the second option about binary operators.) Mathematical terminology and notation in general are full of ambiguities which are resolved by extensive contextual knowledge. As noted by a sibling comment, calling it the reciprocal square root would be clearer.
It is the inverse of the square root. If you want to normalise a vector, you divide the components by the length. The length is the sqrt of the sum squares (Pythagoras). Divide is more expensive than multiply. So get the inverse sqrt of the sum of the squared components, then multiply the components by the inverse sqrt.
That's great and all, but nobody needs a 32-bit anything in 2018. This undergraduate paper provides a magic number and associated error bound for 64-bit doubles:
That's not really accurate. Even in cases were 32 bit and 64 bit operations are equally fast on the CPU, 32 bit values still take up half the memory. For many workloads, the limiting factor is cache space. So, if you can use 32 but values, you can get much better performance for those workloads.
And if you’re doing heavy floating point work, you can fit twice as many operations in with a 32-bit float vector as an equally sized double vector, and The vectorized operations happen roughly as fast for both forms, yielding an approximate doubling of speed.
assuming 1) memory bandwidth is the bottleneck and 2) you can keep the tensor values in cache or registers.
I think that GPUs are still vector processing engines, so they should scale with 4x... But assuming google architected the TPU correctly, it should be 16x as fast (I think the architecture is actually that of a rank-2 tensor).
Lots of ML and AI applications are using ever-smaller precisions. Half and even quarter-precision floats are able to maximize efficiency of the various CPU/GPU ALUs.
I was going to mention that... Just because we have ridiculous transistor budgets don't mean there aren't problems where you need/want to push the envelope for performance instead of precision. If anything, it grows the applicable problem space.
Even scientific calculation would be fine with 32 bit floats, but average floating point error due to representation creeps with ON (iirc) over N multiplications, so you have to use 64 bit for many scientific applications to get satisfactory results after a million or a trillion multiplications.
I wanted to put over a billion floats in a numpy array just a few months ago. Making them 16-bit saved a lot of memory.
It doesn't matter how much resource limits increase, people are going to keep hitting them. And when they hit them, using a smaller data type will always help.
That's not relevant there are plenty of single precision float applications today (and many fixed point applications as well). It all depends on your workload.
E.g. Puls from 2009: https://www.pouet.net/prod.php?which=53816 (check the youtube link if you don't have an MS-DOS ready)
I understand little about extreme sizecoding, but I suspect it's a similarly obsessed mathy story as this blog post, to double use the same bytes as code and content in a way that things actually work and look great.