Chebyshev Approximation

pietjepuk88 · on Aug 25, 2015

I was actually looking into Chebyshev polynomials and minimax optimization to make a fast vectorized implementation of the logarithm function a few days ago. In the process of looking around the internet for information I ran into this: https://github.com/pkhuong/polynomial-approximation-catalogu... (also see link to blog at bottom)

He has nice sets of coefficients derived by taking into account the discrete values of floats/doubles, and the performance gain by setting some coefficients to 0, 1 or 2.

The only thing he did not account for is precision loss when multiplying and adding values when actually calculating the polynomial, which unfortunately results in a large relative error around log(1). Eigen's vectorized implementation does some weird tricks with changing the range to sqrt(0.5) to sqrt(2), and does not exhibit this large relatively error. The unvectorized MSVC implementation does not exhibit this either.

Does anyone how Microsoft and Intel and the like derive the coefficients for their approximations of the elementary functions in math.h?

eliteraspberrie · on Aug 26, 2015

The reference collection of polynomial approximations is: Approximations for digital computers by Cecil Hastings Jr. The method to derive the polynomials is described there as well as coefficients for many common functions. It isn't as complicated as it seems. Intel has probably automated the procedure.

JacobiX · on Aug 25, 2015

If you want a very simple introduction to Chebyshev approximation http://sidewindiegames.com/?p=436 I skip many useful details. If you are realy interested I highly recommend this article by Sam Hocevar http://lolengine.net/blog/2011/12/21/better-function-approxi...

jacobolus · on Aug 25, 2015

Inre the first blog post link, the barycentric interpolation formula is a pretty awesome way to efficiently evaluate polynomial interpolants, https://people.maths.ox.ac.uk/trefethen/publication/PDF/2004...

The standard Lagrange interpolation formula implemented in that blog post works too, but is O(n^2) vs. O(n) for each evaluation, and not numerically stable.

Also see http://www.maths.manchester.ac.uk/~higham/narep/narep440.pdf https://people.maths.ox.ac.uk/trefethen/publication/PDF/2011...

Inre the second blog post link, it’s generally a bad idea to evaluate polynomials on an interval in terms of the monomial basis.

But anyhow, on the subject of the Remez algorithm, see http://www.chebfun.org/publications/remez.pdf http://www.chebfun.org/examples/approx/BestApprox.html

Note that while a standard Chebyshev polynomial approximant isn’t quite “optimal” for any particular function, in an L∞ norm minimax sense, it’s much easier to find, and often better for most values of x. See Trefethen’s book for details, and also the Chebfun guide, http://www.chebfun.org/docs/guide/guide04.html or this paper has a nice example picture under “myth 3” https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf

JacobiX · on Aug 25, 2015

The paper on Barycentric Lagrange Interpolation did a pretty good job at explaining clearly how to evaluate efficiently the polynomial.

drallison · on Aug 25, 2015

It's always interesting to review numerical mathematics at this level. If this interests you, you may want to read C Lanczos's book Applied Analysis and Hank Warren's book Hackers Delight as well as the classic MIT HAKMEM report available on the Web. On machines with highly asymmetric architectures it's sometimes worthwhile to use a superoptimizer approach (https://en.wikipedia.org/wiki/Superoptimization). See the classic paper by Alexia Massalin and the gnu implementation.

mempko · on Aug 25, 2015

I love how the context is embedded systems, yet python is used to demonstrate the technique. Are people really that bad at reading C or C++?

jason_s · on Aug 25, 2015

The author (me) frankly hates C and C++. Haven't touched it for use on a PC since 2009. However I use it every day for my job working on dsPIC motor control, because I have no better options.

When I am trying to teach someone about algorithms, I turn to Python because it is more readable, quicker for me to write, it's platform-independent, and I can incorporate it into an IPython (now Jupyter) Notebook and postprocess that into my blog articles with a minimum of hiccups.

3pt14159 · on Aug 25, 2015

Have you checked out Nim? I've been using it in production for a couple small programs and I've had no issue. I've also been able to bind it to higher level languages using the FFI:

http://nim.community/answers/create_a_shared_library/

Although I haven't figured out the garbage collection bit for the shared object bit. It is a lovely language and it's reaching 1.0 sometime this year.

jason_s · on Aug 25, 2015

I've heard of it. Rust is probably the most promising to me.

Unfortunately just because a language exists, doesn't mean it is usable on a particular embedded system architecture. The processor core and system libraries need support.

mindslight · on Aug 25, 2015

Rust works fine with Cortex-M3/M4 (a bit larger than dsPIC, and their documentation isn't as nice as Microchip's). Library support isn't great though - try zinc.rs or roll your own. An STM32F4 Discovery board is $15. Just in case you didn't have enough to play with...

mokus · on Aug 25, 2015

Shameless self-plug coming here, but it's pretty directly relevant: I've recently started tinkering with rust on a handful of STM32 "discovery" boards. I'm currently focusing on building a low-level environment to use as a foundation for higher-level HAL projects like zinc. The code is still pretty rough around the edges and evolving fast, but I'm open to collaboration and willing to help deal with breakage due to the current pace of change, as time permits, if anyone just wants to build something on it.

https://github.com/mokus0/stm32.rs

mindslight · on Aug 25, 2015

I was/am doing something similar (haven't touched it in a bit though).

I avoided putting anything in linker scripts that didn't have to be there (IIRC, memory blocks and VTOR (for assembly startup routine that can handle flash or RAM)). Peripheral addresses stayed in Rust so that the optimizer could see them (eg writing to RA and then RB, the high u16 stays the same).

BTW F103C8 "dev boards" are available from China for $5.

mokus · on Aug 25, 2015

zinc's "ioregs" macro system takes care of that sort of optimization quite nicely these days, actually. For each peripheral register, one of the things it generates is an "update" structure that the optimizer can eliminate, which allows you to chain updates to multiple fields in a single write. For example, in my stm32f4-discovery "blinky" example it does the bsrr writes to flip 2 LEDs in a single write (I've inspected the generated asm to verify this).

EDIT: nevermind, I realize I misunderstood; you're referring to the optimizer being able to see that both reg addresses have the same high 16 bits. I thought you were referring to the actual interactions with the register contents. Yea, that makes sense.

EDIT 2: for what it's worth, i did a quick test just now and found that moving the addresses into the source actually made the generated code significantly worse. I'm just tinkering around on a lunch break and I need to get back to real work, so I can't investigate too deeply, but the 2 biggest factors I see are that (1) it simply doesn't perform the optimization you describe, and (2) it makes things even worse because rather than addressing relative to the peripheral base address it loads a complete new constant address for every operation.

mindslight · on Aug 25, 2015

I just remembered, hardcoded addresses are actually even more important for utilizing "bit banding" for atomic writes to peripheral registers. Not that you couldn't find a way to do this with addresses in the linker script, but it's more straightforward to handle efficiently with compiler optimization.

(And if one isn't atomically writing to control registers when interrupts are enabled, you are most likely building on a foundation of nondeterministic bugs. Rust's linear types kind of inform you of this, but to do anything useful the temptation is to unsafely share across threads)

re edit 2: Really? I'd obviously looked at my generated code and saw the expected improvement, so we must be doing something different.

mokus · on Aug 25, 2015

Re edit 2: I'm pretty new at Rust so it's entirely possible I did something in an unusual way that caused the compiler to emit worse code than it would otherwise. My quick hack to test was to declare the registers as &mut, and assign them by casting a u32 literal to *mut, dereference and take &mut of that. It was just the first thing that actually compiled ;). I did also try directly transmuting the u32 to &mut and got the same output.

It's also possible something has changed in rustc since you were playing with it.

mindslight · on Aug 25, 2015

It's very possible rustc has changed, but I hope not because the generated code was looking pretty good.

My basic register write is volatile_store(($base as * mut u8).offset(idx as isize) as * mut u32, v); $base is just a hex literal address. There is an extra cast because I defined 'idx' in terms of bytes.

I might not have been concerned about the absolute smallest code when I accessed $base+4 then $base+8 (use of a small offset vs loading a second constant), but I definitely observed eliminating redundant loading of the high half between different registers.

I was compiling with "-O -Z no-landing-pads -C lto".

jason_s · on Aug 25, 2015

well, I work for Microchip, so I won't be writing code for other manufacturers' processors, at least not for my day job :-)

mindslight · on Aug 25, 2015

Well then go lobby for creating dsPIC support in LLVM! Selling 10M chips at a time is great and all, but designs start with a single engineer. The 16F84 was the hobbyist darling until gcc-avr ate its lunch.

makomk · on Aug 26, 2015

PIC is not an open-source-friendly company. They literally wrapped a license manager around gcc and used it as their C compiler on several platforms, and their libraries come with a license forbidding you from using them with a compiled-from-source version of gcc. If they did switch to LLVM, it'd probably be so they could closed-source their compilers completely.

mindslight · on Aug 26, 2015

Sheesh. I worked with their 8-bit chips in an era when using a C compiler was essentially just a nicer syntax for ASM (and you really didn't want to deal with code written by someone who thought differently). In that environment, their easily available good documentation is about as close as you can get to Free. Software complexity was of course on the rise and I figured it was only a matter of time until they "got it" with regards to a Free C compiler etc.

It's sad to hear that they've dug their heels in even further and are now one of the few examples of bad-faith attacking the GPL through contracts. Especially because recently reading their ENC28J60 documentation was a nice break after wading through STM/ARM's stuff.

If I'm ever again in a position where I'm choosing chips for a production design, I'll be verifying what you said for myself (diligence) and then steering clear to less risky alternatives.

jason_s · on Aug 27, 2015

I'm not sure I would believe the comment by the person who you are responding to.

mindslight · on Aug 27, 2015

> their libraries come with a license forbidding you from using them with a compiled-from-source version of gcc

This is a quite serious concrete accusation. It would put Microchip in the company of Tivo and Sveasoft - not simply ignoring the GPL, but actively attacking it through bad-faith contractual restrictions.

But yes, I would definitely have to see the specific license for myself before forming any conclusions. HN is essentially just gossip, and GP could have very well interpreted a clause too broadly. But I'm not interested in sorting through what's new in the Microchip world just to preemptively answer this, especially because he could be referring to a single specific library that would be hard to find.

jason_s · on Aug 28, 2015

Microchip does have a standard license for maybe 99% of its firmware libraries, that includes this proviso, that just forbids you from using the software on microcontrollers made by other manufacturers:

  Microchip licenses to you the right to use, 
  modify, copy and distribute Software only
  when embedded on a Microchip microcontroller 
  or digital signal controller that is integrated 
  into your product or third party product
  (pursuant to the sublicense terms in the
  accompanying license agreement).

If there are any other restrictions, I'm not aware of them.

jason_s · on Aug 27, 2015

PIC is not the name of the company. It's Microchip Technology, Inc. You seem to be upset about something, since I don't think you have the facts straight about the XC series of compilers; the source code for XC16 and XC32 is posted here under "Source Archive" http://www.microchip.com/pagehandler/en-us/devtools/dev-tool...

Microchip has a number of open-source initiatives. You can use the internal guts of MPLAB X from Java via MDBcore, for example, as a scriptable debugger/simulator.

jason_s · on Aug 25, 2015

>Well then go lobby for creating dsPIC support in LLVM!

Um. Been there, done that. Well, you see, [: discussion of internal politics deleted :]

3pt14159 · on Aug 26, 2015

Well Nim is essentially just C, although there is some weirdness before it hit 1.0 (like big ints everywhere). Try it out and run the tests. If they pass, then you are probably in good hands.

sophacles · on Aug 25, 2015

I think it comes down to the same reason algorithm papers use psuedo code rather than a particular language - the point is to represent the main idea clearly, rather than bog down the reader in all the accounting and syntax of a particular language.

Details about a particular variable's type, or a languages semantics don't tell me anything about (in this case) chebyshev or taylor functions, but to follow the code I'd have to internalize the interactions of those things, along with the actual math for it all to make sense.

Consider it an abstraction layer for learning. Just like when dealing with files on disk I use open(), read(), etc rather than work in my fs + hdd controller.

jacobolus · on Aug 25, 2015

I think the key part you’re missing is “Here's a quick Python function that uses numpy and matplotlib to plot the results:”

Doing the same in C is going to be a lot more effort.

sp332 · on Aug 25, 2015

C code would use twice as much room on the page for function & variable declarations and brackets.

monochromatic · on Aug 25, 2015

Python is easier to read. Somebody with almost no programming experience at all can read (simple) Python and understand what's going on.

jacobolus · on Aug 25, 2015

Folks interested in this topic should look at Chebfun, http://www.chebfun.org

Also at Lloyd Trefethen’s book Approximation Theory and Approximation Practice, of which the first 6 chapters are available online, http://www.chebfun.org/ATAP/

The author of this blog post might as well have used Clenshaw’s algorithm as he mentioned was suggested by Numerical Recipes. It’s like 4–5 lines of code, fast, and has been proven numerically stable.

If bk1, and bk2 are the previous two terms of the recurrence, and c the coefficient array, code to find y = f(x) for x in [-1, 1] looks like:

  x2 = 2*x
  bk2 = bk1 = 0
  for ck in c[:0:-1]:
      bk2, bk1 = bk1, ck + x2*bk1 - bk2
  y = c[0] + x * bk1 - bk2

(Note: There might be some typos there, I didn’t try running that. Here’s a vectorized Matlab version https://github.com/chebfun/chebfun/blob/development/%40chebt... with coefficients in that version ordered from highest to lowest)

Or if he wanted an easy Python tool, just used this (partial) python port of Chebfun (though I guess it dates from around the same time as his post), https://github.com/olivierverdier/pychebfun

I’m not convinced about his measurements on an arbitrary grid -> chebyshev coefficients via weighted least squares method. If his measurement points are reasonably clustered near the endpoints of his interval, he could just use an exactly interpolating polynomial and use the barycentric interpolation formula to get the values at the chebyshev points (see Trefethen’s book). His approach might also be fine, I haven’t thought deeply about it. Also see http://www.chebfun.org/examples/approx/RationalInterp.html http://www.chebfun.org/examples/approx/EquispacedData.html http://www.chebfun.org/examples/approx/BestL2Approximation.h...

The neat thing about Chebyshev polynomials is that it’s possible to convert from values at the Chebyshev points (either extrema or roots of the Chebyshev polynomial) to Chebyshev coefficients by using an FFT, which takes O(n log n) time. Along with various other efficient algorithms for differentiation, integration, root finding, solving differential equations, etc., Chebfun makes it practical to deal with polynomials of very high degree, and do lots of neat stuff with them.

This blog post’s statement that “higher degree polynomials can have problems with numerical accuracy” turns out to be a myth: https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf

jason_s · on Aug 25, 2015

>This blog post’s statement that “higher degree polynomials can have problems with numerical accuracy” turns out to be a myth

(author here) I should have qualified or corrected that statement in my article. There are a few reasons I rarely use n>5:

- when polynomial approximation is appropriate, high-degree polynomials are usually unnecessary on embedded systems (0.1% accuracy usually sufficient)

- it's usually a sign that polynomial approximation is being misapplied and there are better ways (via range reduction, transformation like 1/f(x), etc.)

- with least-squares, computation of the coefficients can have problems with numerical accuracy. When I've used least-squares fitting polynomials with n>10, I often get a complaint from my numerical tool (MATLAB or Python) that the matrix in question is ill-conditioned. That may be because I used the Vandermonde matrix rather than Chebyshev polynomials, or because I used a large number of points. Chebyshev approximation doesn't have the same problem, I don't think, and I should not have mapped my experience with least-squares fitting onto Chebyshev approximation.

Evaluation of the polynomials themselves should be fine (after all, we can easily compute a 20th-degree Chebyshev polynomial).

jacobolus · on Aug 25, 2015

For fitting arbitrary spaced data, you might also try using a rational interpolant, e.g. Floater & Hormann’s http://www.inf.usi.ch/hormann/papers/Floater.2007.BRI.pdf and then afterward approximating that by a Chebyshev polynomial of somewhat higher degree for evaluation or other operations like taking derivatives or finding roots etc. (IIRC Trefethen’s book explains this idea, and the last few links in my comment above are also related).

srean · on Aug 25, 2015

High order monomials and uniform grids are a bad mix. If you have the flexibility of choosing the grid many of the problems of higher order poly can be overcome. But as you said using very high orders should indeed set off alarm bells in the worst case and warrant good justification in the best case

jacobolus · on Aug 25, 2015

I’m realizing my comment sounds kind of negative. Overall I thought it was a great blog post, full of nice pictures!

jason_s · on Aug 25, 2015

>The author of this blog post might as well have used Clenshaw’s algorithm

So is that essentially sort of like Horner's Rule but for Chebyshev polynomials? e.g. instead of a0 + (a1 * x) + (a2 * x * x) + (a3 * x * x * x) + ... , use a0 + x * (a1 + x * (a2 + x * (a3 + ... so that the coefficients are used in the reverse order, and instead of having to maintain a sum and a value of x^n, we just have a single accumulator; in Clenshaw's algorithm we have the two accumulators bk1 and bk2, but they look similar.

srean · on Aug 25, 2015

For the multicore-cpus-with-SIMD instruction world, Horner's isn't the best choice. Horners minimizes floating point ops at the expense of parallelism. If you think of the instructions as a parse tree you would want one with a higher branching factor at the top, those you can arrange to have executed in parallel.

jacobolus · on Aug 25, 2015

If you’re evaluating the same polynomial for a million different points (or heck, 20 points), then there should be plenty of parallelism up for grabs regardless of the algorithm used for each evaluation, so minimizing float ops is probably still reasonable.

Such a workload is much more common than a case where you need to evaluate a polynomial of degree one million at one single point.

Note that you can use SIMD, multi-core, GPUs, etc. for your multiple point evaluations, and these algorithms can be turned into mostly FMA instructions.

EDIT, moving this comment inline: To see several suggested improvements to Horner’s rule, take a look at Knuth’s TAOCP volume 2, §4.6.4, which is (I assume) a decent summary of the state of the art in the late 90s, though there’s surely more literature since.

Anyway though, we’re mostly talking about evaluating Chebyshev polynomials here. If you know some papers about speeding up Clenshaw’s algorithm, I’d love to look them up. (I’m currently trying to figure out how fast I can make such code run on a GPU/CPU, for my own use projecting high-resolution raster satellite images onto different map projections, so speedups would be welcome.)

srean · on Aug 25, 2015

Correct, but in many real cases you can do much better than using Horner's especially when you want to exploit SIMD. Infact dumb sum of powers of monomials can run faster than Horner's. Typically your compiler will vectorize them to use SIMD pretty effortlessly. It took me by surprise when I saw it. If you have millions you could probably do as you said and still gain a lot of speedup but for few hundreds it might be a different story.

EDIT @jacobolus does it cover Estrin's ? Its certainly older than the 90s. The main problem is Horners does not have any parallelism. With Estrin's one can oarallelize along two axes: Use threads for one and simd for another. Although I believe one can play with the parse tree to get better speed than Estrins on your specific hardware.

jacobolus · on Aug 25, 2015

> does it cover Estrin's?

Knuth, page 488:

Another attractive method for parallel computation has been suggested by G. Estrin [Proc. Western Joint Computing Conf. 17 (1960), 33–44]; for n = 7, Estrin’s method is:

  Processor 1:       Processor 2:     Processor 3:
  a1 = u7*x   + u6   b1 = u5*x + u4   c1 = u3*x   + u2
  a2 = a1*x^2 + b1                    c2 = c1*x^2 + d1
  a3 = a2*x^4 + c2

  Processor 4:       Processor 5:
  d1 = u1*x + u0     x^2
                     x^4

Here a3 = u(x). However, an interesting analysis by W. S. Dorn [IBM J. Res. and Devel. 6 (1962) 239–245] shows that these methods might not actually be an improvement over the second-order rule, if each arithmetic unit must access a memory that communicates with only one processor at a time.

* * *

Personally this seems like a lot of implementation trouble and communication overhead, and I suspect it will lose out badly to a GPU cranking out evaluations at separate points.

srean · on Aug 25, 2015

For my own thing I used neither Estrin's nor Horner's. I profiled a bunch of equivalent parse trees and then zeroed on one that was performing well on the target m/c. You may not need communication though. As lng as it decomposes into mutually independent subtrees the compiler can exploit that to generate better code.

jason_s · on Aug 25, 2015

I agree with Jacobolus --- even with say, 40 points and a 5th order Clenshaw/Horner rule, you have a 40x5 matrix; parallelize along the long dimension.

jacobolus · on Aug 25, 2015

Yes, Horner’s rule is a special case of Clenshaw’s algorithm (which works for any three-term recurrence). https://en.wikipedia.org/wiki/Clenshaw_algorithm

mturmon · on Aug 25, 2015

Thanks for the link to Trefethen's myths paper. Very good read.

I have this learned fear of a lot of operations on polynomials, like root-finding, that are based on using the wrong set of basis functions (i.e., the monomial powers x^n), and the linked paper helped me to put this superstition in context.

irremediable · on Aug 25, 2015

Thanks for an excellent comment -- IMO more useful than the original article. I skimmed a short Trefethen course at the behest of a physicist friend once, and was really impressed by it. I want to study that stuff in more depth someday.

nimish · on Aug 25, 2015

It's a damned good course and worth a lot of time.

chestervonwinch · on Aug 25, 2015

Just to add to reference pile: in addition to Trefethan (who is also the author of some fantastic texts on many other subjects!), Isaacson & Keller's Analysis of Numerical Methods contains a great introduction to this subject.

Castlereagh · on Aug 25, 2015

Incidentally, Chebyshev is pronounced "cheh-bee-shoff", not (as is commonly heard) "cheh-bee-shev".

jason_s · on Aug 26, 2015

Чебышёв!

daniel-levin · on Aug 25, 2015

Shameless plug: here is my GPL'd MATLAB implementation of Chebyshev approximation [1]

[1] https://github.com/daniel-levin/numerical-methods-3/blob/mas...

srean · on Aug 25, 2015

If this interests you may I suggest Pade approximation as another good tool to have in your tool chest.

daniel-levin · on Aug 25, 2015

You can use both at the same time for extremely nice results. Compute some truncated Chebyshev polynomial approximation (i.e. the first few terms) and then use Pade approximation [1]. The cool part about this is that you can prescribe your maximum allowed error (between exact solution and approximation) ahead of time, and compute the order of the Pade approximation that under all circumstances guarantees that your error will be bounded!

[1] http://siber.cankaya.edu.tr/NumericalComputations/Fall2004/c...

jason_s · on Aug 25, 2015

is there a way to have Pade approximation but with continued fractions, so you just truncate the series as necessary? I tried looking into this a few months ago, but my head just started to hurt.

weirdtunguska · on Aug 25, 2015

I used Chebyshev approximation on my thesis connecting a boundary value problem to a way to solve it using Fourier Transform. Very, very useful when solving a system of equations that can diverge if a iterative solver is used.

lifeisstillgood · on Aug 25, 2015

Oh God. It's like I thought I knew how to read and suddenly the comic books were replaced with Shakespeare and I realise there is a huge mountain to climb.

Somehow my children cannot grow up as mathematically stunted as I am.

jason_s · on Aug 25, 2015

Heh. Two immediate comments:

- It's not as bad as that. Chebyshev approximation + other techniques for evaluating mathematical functions are pretty easy to learn.

- You're absolutely right. Welcome to mathematics! The mountain goes infinitely high. There are tons of technical papers I'll never be able to digest simply because they've compressed lots of thoughts into cryptic-looking but standard notation.

lifeisstillgood · on Aug 26, 2015

Thanks :-) I'll get back to you once I have mastered long division !

jason_s · on Aug 29, 2015

Ironically, that's one of the few useless parts of mathematics, and the least favorite part of my elementary school education.

Suggest group theory or linear algebra instead. ;-)

juskrey · on Aug 25, 2015

How to overfit a function, and lose all money, friends and influence at once (after a happy while).