Hacker News new | past | comments | ask | show | jobs | submit login

>The author of this blog post might as well have used Clenshaw’s algorithm

So is that essentially sort of like Horner's Rule but for Chebyshev polynomials? e.g. instead of a0 + (a1 * x) + (a2 * x * x) + (a3 * x * x * x) + ... , use a0 + x * (a1 + x * (a2 + x * (a3 + ... so that the coefficients are used in the reverse order, and instead of having to maintain a sum and a value of x^n, we just have a single accumulator; in Clenshaw's algorithm we have the two accumulators bk1 and bk2, but they look similar.




For the multicore-cpus-with-SIMD instruction world, Horner's isn't the best choice. Horners minimizes floating point ops at the expense of parallelism. If you think of the instructions as a parse tree you would want one with a higher branching factor at the top, those you can arrange to have executed in parallel.


If you’re evaluating the same polynomial for a million different points (or heck, 20 points), then there should be plenty of parallelism up for grabs regardless of the algorithm used for each evaluation, so minimizing float ops is probably still reasonable.

Such a workload is much more common than a case where you need to evaluate a polynomial of degree one million at one single point.

Note that you can use SIMD, multi-core, GPUs, etc. for your multiple point evaluations, and these algorithms can be turned into mostly FMA instructions.

EDIT, moving this comment inline: To see several suggested improvements to Horner’s rule, take a look at Knuth’s TAOCP volume 2, §4.6.4, which is (I assume) a decent summary of the state of the art in the late 90s, though there’s surely more literature since.

Anyway though, we’re mostly talking about evaluating Chebyshev polynomials here. If you know some papers about speeding up Clenshaw’s algorithm, I’d love to look them up. (I’m currently trying to figure out how fast I can make such code run on a GPU/CPU, for my own use projecting high-resolution raster satellite images onto different map projections, so speedups would be welcome.)


Correct, but in many real cases you can do much better than using Horner's especially when you want to exploit SIMD. Infact dumb sum of powers of monomials can run faster than Horner's. Typically your compiler will vectorize them to use SIMD pretty effortlessly. It took me by surprise when I saw it. If you have millions you could probably do as you said and still gain a lot of speedup but for few hundreds it might be a different story.

EDIT @jacobolus does it cover Estrin's ? Its certainly older than the 90s. The main problem is Horners does not have any parallelism. With Estrin's one can oarallelize along two axes: Use threads for one and simd for another. Although I believe one can play with the parse tree to get better speed than Estrins on your specific hardware.


> does it cover Estrin's?

Knuth, page 488:

Another attractive method for parallel computation has been suggested by G. Estrin [Proc. Western Joint Computing Conf. 17 (1960), 33–44]; for n = 7, Estrin’s method is:

  Processor 1:       Processor 2:     Processor 3:
  a1 = u7*x   + u6   b1 = u5*x + u4   c1 = u3*x   + u2
  a2 = a1*x^2 + b1                    c2 = c1*x^2 + d1
  a3 = a2*x^4 + c2

  Processor 4:       Processor 5:
  d1 = u1*x + u0     x^2
                     x^4
Here a3 = u(x). However, an interesting analysis by W. S. Dorn [IBM J. Res. and Devel. 6 (1962) 239–245] shows that these methods might not actually be an improvement over the second-order rule, if each arithmetic unit must access a memory that communicates with only one processor at a time.

* * *

Personally this seems like a lot of implementation trouble and communication overhead, and I suspect it will lose out badly to a GPU cranking out evaluations at separate points.


For my own thing I used neither Estrin's nor Horner's. I profiled a bunch of equivalent parse trees and then zeroed on one that was performing well on the target m/c. You may not need communication though. As lng as it decomposes into mutually independent subtrees the compiler can exploit that to generate better code.


I agree with Jacobolus --- even with say, 40 points and a 5th order Clenshaw/Horner rule, you have a 40x5 matrix; parallelize along the long dimension.


Yes, Horner’s rule is a special case of Clenshaw’s algorithm (which works for any three-term recurrence). https://en.wikipedia.org/wiki/Clenshaw_algorithm




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: