Clover: 4-Bit Quantized Linear Algebra

0xfaded · on Aug 20, 2018

  Once the data exceed L3 cache,
  4-bit is fastest since the dot
  product is memory bound: data
  movement from RAM becomes the bottleneck.
  The speed-up is up to 6x over
  the 32-bit version.

Large linalg operations are memory bound.

I learned this the hard way hand coding an arm32 5x5 gaussian blur. When benchmarking my first version, using two passes of the separatable 5x1 and 1x5 filters, I had a facepalm moment when I realized I was really only measuring two round trips to main memory.

Operating on 16x8 pixel blocks, which fit within the register file, required most pixels to only be fetched once and almost doubled performance.

Nokinside · on Aug 20, 2018

Just imagine the speedup after somebody finally integrates vector dot product operation into DRAM silicon.

Currently mixing CMOS and DRAM prosesses into same silicon is not cost effective for mass production. I think mixing of the processes and heating issues are the main roadblocks.

kjeetgill · on Aug 21, 2018

I've never really worked at such a fine level. I'd love to hear more about how you measured and benchmarked things and were able to draw conclusions from them.

When benchmarking, I usually have a larger workload in mind. I'm usually working on and deciding between two significant algorithmic changes and I'm looking at differences in 100-1000ms and I still find the standard deviations unsettling.

vanderZwan · on Aug 20, 2018

Obligatory link to Halide, a language that is all about making hand-written scheduling easier:

http://halide-lang.org/

0xfaded · on Aug 20, 2018

Halide is cool, but from when I last looked at it its best at optimizing for cache usage. For my example, there would have still been two round trips, but to L1 instead of something farther.

I've compared using GCC intrinsics vs what I would have wanted. It doesn't really get instruction scheduling right. Things like efficiently using pipelines by grouping, say, multiplication instructions and not stalling by ensuring the right number of cycles between them.

I have some specialised block matrix multiplication which basically never stall the processor. We're talking 10x speedups over optimized but general c++ code.

AstralStorm · on Aug 20, 2018

Also true for most GPUs which are ROP limited. Even more so as register files there are big.

natch · on Aug 20, 2018

This is beautifully presented.

Addressing the author(s): I don’t know much about the topic area but you made it interesting, and it is a trove of cool tricks and techniques (sure, bit manipulation is basic but it’s fun to see how you put it all together to solve the problems). And how you analyze the performance. Look forward to coming back to this after I get more up to speed on linear algebra.

mchahn · on Aug 20, 2018

Many years ago (1970s) I used very expensive but very accurate HP spectrum analyzer. It would take in a signal and present the spectrum on a screen. I studied the manual and found that it used a 1-bit(!) A/D. When averaging over many samples the resolution would build up.

kragen · on Aug 20, 2018

ΔΣ converters ("1-bit ADCs") have been commonplace since the 1970s; typically they use some feedback tricks to get better signal-to-noise ratios than you would calculate straightforwardly. 1-bit DACs were common in CD players throughout the 80s, and the SACD format about 20 years back was a raw delta-sigma bitstream. Unfortunately the Wikipedia article is not very good: https://en.wikipedia.org/wiki/Delta-sigma_modulation

Fundamentally, the reason they're especially accurate is that most of the usual sources of error in digital–analog conversion just don't exist in a 1-bit ADC. INL? Zero. DNL? Zero. Nonmonotonicity? Please.

Clover, by contrast, is interesting to me for a different reason: a lot of ANN inference stuff seems to work fine at 8 bits of precision. It'll be interesting to see if those results can extend to 4 bits.

mchahn · on Aug 20, 2018

Thanks for the info on ΔΣ converters. It connects several things I knew about.

Now I need to study up on the area of computation that clover lives in.

dragontamer · on Aug 20, 2018

Just to note how common delta-sigma converters are: Arduino's got them for voltage sensing.

I dare say that the most common AD converters are simply delta-sigma converters, especially if 1MHz conversions is fast enough for your tasks.

To get into the 100MHz or above speed (ie: Oscilloscopes, Software defined Radios, etc. etc.) you need to use other techniques.

kragen · on Aug 20, 2018

When you say "Arduino's got them for voltage sensing," are you talking about the ADC in the ATMega328 that powers most of the Arduino models from the last decade? Because that ADC isn't a ΔΣ, it's a 10-bit successive-approximation ADC.

dragontamer · on Aug 21, 2018

Ah yes, it seems like I've been mistaken :-(

I distinctly remember that Delta-Sigma was a slow-type of ADC. And I guess I confused it with the one in the ATMega328p.

kragen · on Aug 21, 2018

You're right, it's an ATMega328P, not an ATMega328 like I said.

The ATMega's ADC is pretty slow, like 15ksps at full precision, so I can see why you'd think that!

mchahn · on Aug 20, 2018

> To get into the 100MHz or above speed

That gets back to my original point. The spectrum analyzer went into the gigahertz range. I don't think it was delta sigma. It was some multiple sample thing that only worked in the frequency domain.

gugagore · on Aug 25, 2018

It might have gone into the GHz range, but I'm guessing with a narrow bandwidth. It's the bandwidth that determines the sampling rate needed.

kragen · on Aug 24, 2018

You can make spectrum analyzers that don't digitize at all, of course.

raverbashing · on Aug 20, 2018

Well, I think most popular DACs in consumer electronics today are 1bit SD, though I remember older CD players having a 16bit DAC

They require higher clock frequencies and aren't that accurate (though, as you mentioned, some other sources of inaccuracy are simply nonexistent)