Worth noting that Kahan summation works under the assumption that, at every step of the summation, the current sum will be larger than the term added.
It is not too hard modify the algorithm in order to remove this assumption and it increase the accuracy of the algorithm sensibly (but sadly a test, as suggested on the wikipedia page, is not the best way to do that).
Here is a Rust crate (not mine) that offers various implementation of increasing accuracy and an exact sum (slow but it will return the floating point that is closest to the sum of your inputs) : https://crates.io/crates/accurate
I've used it twice. Once when inverting large matrices, where the built-in dot product algorithm was inaccurate, and once when working with a very small process noise with Kalman filters, with 32 bit floats due to platform restrictions, where the process noise could only increase uncertainty to a certain point after which there was underflow.
Both times it has worked brilliantly, allowing a targeted increase in accuracy on only the components that needed it, allowing to choose the memory / computational hit desired, similar to using extra bits in certain temporaries when using fixed point.
There's not much point to this algorithm now. It is faster to produce the correctly-rounded result of the exact sum (at least if you are summing more than a few terms), rather than use Kahan summation to produce a still-not-entirely-accurate result. The wikipedia page mentions the python fsum function that does this, but my method at https://gitlab.com/radfordneal/xsum is faster.
This is a nice result for the single-scalar case, but it is not immediately obvious to me how it would scale to support SIMD, given its use of per-element indexing and a periodic conditional fallback to a slowpath which does not trigger at the same time for all lanes. Extending Kahan summation (or similar techniques) to take advantage of SIMD, however, is completely straightforward. In the age of AVX-512, the speedups from SIMD are hard to ignore even for 64-bit precision arithmetic.
This is quite interesting, and I'll be examining your method more closely in the future. A skim of the paper convinces me that there is merit.
The main upside to the Kahan method is that it can be incremental and online. Imagine that one is writing a Prometheus-like metrics client, and keeping a running tally. One cannot reorder the summation, and one cannot take advantage of parallelism. In these cases, a small Kahan accumulator can perform incredibly well.
> The main upside to the Kahan method is that it can be incremental and online.
I got excited when I heard the claims of a method better than Kahan/Neumaier summation, but the storage requirements are enormous (6700% versus for the "small" version, versus 200% for Kahan), so I definitely wouldn't call Kahan summation "pointless". It's in fact one of the most amazing and underutilised bits of CS out there IMO.
I'm surprised to see this author randomly appear! Radford Neal was my professor at the University of Toronto for a course on information theory, CSC310H.
I like that there's an accumulator struct. fsum is nice, but it's ugly to stash everything into a list; it's often prefereable to push terms into an accumulator and get the sum at the end.
I remember reading about Kahan summation, and I naively thought it would help improve the accuracy/consistency of a n-body code I was using for my PhD. I was trying to resolve a discrepancy between summations when performed serially versus in parallel using OpenMP. In the end I abandoned the approach, since it seems the issue lay elsewhere.
I've employed this Kahan summation in a different physics simulation that was giving parallel vs serial discrepancies due to global sums that sum in different orders in parallel. It did help, but I ultimately concluded that it isn't a complete solution for this issue... in short, I think it just bumps the problem several orders of magnitude down.
For what it's worth, and in the event that your issue was indeed the differing order of summing-- there is a solution that /does/ work, which is to cast the numbers to a large (eg, 256 or 512 bit) fixed point representation and leverage the reproducibility of integer sums.
The issue here is that unless you have checks that your values dont over or underflow at every stage of the calculation, you can never actually know if your answers are accurate, even if they might be repeatable.
I would think that in most cases you should be able to prove an upper bound on the sum and make sure the accumulator is large enough to accommodate it. My guess is that 512 bits should be enough to span all physically/computationally relevant scales.
Note that digits of precision are expressed in decimal unless explicitly given as “bits”, so that the phrase “10,000-digit arithmetic” really means around 33,000 bits.
It is not too hard modify the algorithm in order to remove this assumption and it increase the accuracy of the algorithm sensibly (but sadly a test, as suggested on the wikipedia page, is not the best way to do that).
Here is a Rust crate (not mine) that offers various implementation of increasing accuracy and an exact sum (slow but it will return the floating point that is closest to the sum of your inputs) : https://crates.io/crates/accurate