> Having said that, I’m curious as well if using effort you can get better resul...

kolinko · 2024-04-18T06:56:17 1713423377

Below Q8 I think it will combine poorly - bucketMul needs to keep a few bits for an index. It can be offset by the fact that the numbers from each bucket are from a limited range. My intuition is that with Q8 this trades off roughly evenly - meaning bucketed Q8 may store as much info as regular Q8, but it will be more difficult with lower quantizations, and cross the boundary of impossible after Q5. Don't have math to back it up, but perhaps some information theoretitian could pick up the calculations.

You could say that these are divergent paths in the future developments (if the results hold for Q8) - perhaps you can crop the Q8 models to Q6 sizes and inference speeds of Q2/Q4. On the other hand, the wildly optimistic scenario is that the bucketMul speeds will overcome even Q1, with dynamic computation pruning - token by token and layer by layer, by having a separate small network that chooses effort levels based on the input data. (someone already proposed it in the thread).

For now, the most important thing is fixing the bugs, doing more serious tests for FP16, and showing how the charts look for Q8. Especially the Q8 is the biggest unknown, although the initial results are hopeful.