Once the data exceed L3 cache,
4-bit is fastest since the dot
product is memory bound: data
movement from RAM becomes the bottleneck.
The speed-up is up to 6x over
the 32-bit version.
Large linalg operations are memory bound.
I learned this the hard way hand coding an arm32 5x5 gaussian blur. When benchmarking my first version, using two passes of the separatable 5x1 and 1x5 filters, I had a facepalm moment when I realized I was really only measuring two round trips to main memory.
Operating on 16x8 pixel blocks, which fit within the register file, required most pixels to only be fetched once and almost doubled performance.
Just imagine the speedup after somebody finally integrates vector dot product operation into DRAM silicon.
Currently mixing CMOS and DRAM prosesses into same silicon is not cost effective for mass production. I think mixing of the processes and heating issues are the main roadblocks.
I've never really worked at such a fine level. I'd love to hear more about how you measured and benchmarked things and were able to draw conclusions from them.
When benchmarking, I usually have a larger workload in mind. I'm usually working on and deciding between two significant algorithmic changes and I'm looking at differences in 100-1000ms and I still find the standard deviations unsettling.
Halide is cool, but from when I last looked at it its best at optimizing for cache usage. For my example, there would have still been two round trips, but to L1 instead of something farther.
I've compared using GCC intrinsics vs what I would have wanted. It doesn't really get instruction scheduling right. Things like efficiently using pipelines by grouping, say, multiplication instructions and not stalling by ensuring the right number of cycles between them.
I have some specialised block matrix multiplication which basically never stall the processor. We're talking 10x speedups over optimized but general c++ code.
Addressing the author(s): I don’t know much about the topic area but you made it interesting, and it is a trove of cool tricks and techniques (sure, bit manipulation is basic but it’s fun to see how you put it all together to solve the problems). And how you analyze the performance. Look forward to coming back to this after I get more up to speed on linear algebra.
Many years ago (1970s) I used very expensive but very accurate HP spectrum analyzer. It would take in a signal and present the spectrum on a screen. I studied the manual and found that it used a 1-bit(!) A/D. When averaging over many samples the resolution would build up.
ΔΣ converters ("1-bit ADCs") have been commonplace since the 1970s; typically they use some feedback tricks to get better signal-to-noise ratios than you would calculate straightforwardly. 1-bit DACs were common in CD players throughout the 80s, and the SACD format about 20 years back was a raw delta-sigma bitstream. Unfortunately the Wikipedia article is not very good: https://en.wikipedia.org/wiki/Delta-sigma_modulation
Fundamentally, the reason they're especially accurate is that most of the usual sources of error in digital–analog conversion just don't exist in a 1-bit ADC. INL? Zero. DNL? Zero. Nonmonotonicity? Please.
Clover, by contrast, is interesting to me for a different reason: a lot of ANN inference stuff seems to work fine at 8 bits of precision. It'll be interesting to see if those results can extend to 4 bits.
When you say "Arduino's got them for voltage sensing," are you talking about the ADC in the ATMega328 that powers most of the Arduino models from the last decade? Because that ADC isn't a ΔΣ, it's a 10-bit successive-approximation ADC.
That gets back to my original point. The spectrum analyzer went into the gigahertz range. I don't think it was delta sigma. It was some multiple sample thing that only worked in the frequency domain.
I learned this the hard way hand coding an arm32 5x5 gaussian blur. When benchmarking my first version, using two passes of the separatable 5x1 and 1x5 filters, I had a facepalm moment when I realized I was really only measuring two round trips to main memory.
Operating on 16x8 pixel blocks, which fit within the register file, required most pixels to only be fetched once and almost doubled performance.