Hacker News new | past | comments | ask | show | jobs | submit login
Clover: 4-Bit Quantized Linear Algebra (astojanov.github.io)
94 points by ArtWomb on Aug 20, 2018 | hide | past | favorite | 18 comments



  Once the data exceed L3 cache,
  4-bit is fastest since the dot
  product is memory bound: data
  movement from RAM becomes the bottleneck.
  The speed-up is up to 6x over
  the 32-bit version.
Large linalg operations are memory bound.

I learned this the hard way hand coding an arm32 5x5 gaussian blur. When benchmarking my first version, using two passes of the separatable 5x1 and 1x5 filters, I had a facepalm moment when I realized I was really only measuring two round trips to main memory.

Operating on 16x8 pixel blocks, which fit within the register file, required most pixels to only be fetched once and almost doubled performance.


Just imagine the speedup after somebody finally integrates vector dot product operation into DRAM silicon.

Currently mixing CMOS and DRAM prosesses into same silicon is not cost effective for mass production. I think mixing of the processes and heating issues are the main roadblocks.


I've never really worked at such a fine level. I'd love to hear more about how you measured and benchmarked things and were able to draw conclusions from them.

When benchmarking, I usually have a larger workload in mind. I'm usually working on and deciding between two significant algorithmic changes and I'm looking at differences in 100-1000ms and I still find the standard deviations unsettling.


Obligatory link to Halide, a language that is all about making hand-written scheduling easier:

http://halide-lang.org/


Halide is cool, but from when I last looked at it its best at optimizing for cache usage. For my example, there would have still been two round trips, but to L1 instead of something farther.

I've compared using GCC intrinsics vs what I would have wanted. It doesn't really get instruction scheduling right. Things like efficiently using pipelines by grouping, say, multiplication instructions and not stalling by ensuring the right number of cycles between them.

I have some specialised block matrix multiplication which basically never stall the processor. We're talking 10x speedups over optimized but general c++ code.


Also true for most GPUs which are ROP limited. Even more so as register files there are big.


This is beautifully presented.

Addressing the author(s): I don’t know much about the topic area but you made it interesting, and it is a trove of cool tricks and techniques (sure, bit manipulation is basic but it’s fun to see how you put it all together to solve the problems). And how you analyze the performance. Look forward to coming back to this after I get more up to speed on linear algebra.


Many years ago (1970s) I used very expensive but very accurate HP spectrum analyzer. It would take in a signal and present the spectrum on a screen. I studied the manual and found that it used a 1-bit(!) A/D. When averaging over many samples the resolution would build up.


ΔΣ converters ("1-bit ADCs") have been commonplace since the 1970s; typically they use some feedback tricks to get better signal-to-noise ratios than you would calculate straightforwardly. 1-bit DACs were common in CD players throughout the 80s, and the SACD format about 20 years back was a raw delta-sigma bitstream. Unfortunately the Wikipedia article is not very good: https://en.wikipedia.org/wiki/Delta-sigma_modulation

Fundamentally, the reason they're especially accurate is that most of the usual sources of error in digital–analog conversion just don't exist in a 1-bit ADC. INL? Zero. DNL? Zero. Nonmonotonicity? Please.

Clover, by contrast, is interesting to me for a different reason: a lot of ANN inference stuff seems to work fine at 8 bits of precision. It'll be interesting to see if those results can extend to 4 bits.


Thanks for the info on ΔΣ converters. It connects several things I knew about.

Now I need to study up on the area of computation that clover lives in.


Just to note how common delta-sigma converters are: Arduino's got them for voltage sensing.

I dare say that the most common AD converters are simply delta-sigma converters, especially if 1MHz conversions is fast enough for your tasks.

To get into the 100MHz or above speed (ie: Oscilloscopes, Software defined Radios, etc. etc.) you need to use other techniques.


When you say "Arduino's got them for voltage sensing," are you talking about the ADC in the ATMega328 that powers most of the Arduino models from the last decade? Because that ADC isn't a ΔΣ, it's a 10-bit successive-approximation ADC.


Ah yes, it seems like I've been mistaken :-(

I distinctly remember that Delta-Sigma was a slow-type of ADC. And I guess I confused it with the one in the ATMega328p.


You're right, it's an ATMega328P, not an ATMega328 like I said.

The ATMega's ADC is pretty slow, like 15ksps at full precision, so I can see why you'd think that!


> To get into the 100MHz or above speed

That gets back to my original point. The spectrum analyzer went into the gigahertz range. I don't think it was delta sigma. It was some multiple sample thing that only worked in the frequency domain.


It might have gone into the GHz range, but I'm guessing with a narrow bandwidth. It's the bandwidth that determines the sampling rate needed.


You can make spectrum analyzers that don't digitize at all, of course.


Well, I think most popular DACs in consumer electronics today are 1bit SD, though I remember older CD players having a 16bit DAC

They require higher clock frequencies and aren't that accurate (though, as you mentioned, some other sources of inaccuracy are simply nonexistent)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: