Hacker News new | past | comments | ask | show | jobs | submit login

> Having said that, I’m curious as well if using effort you can get better results on Q8 cropped to 75% than on Q6.

This is what I wanted to ask. This seems like the same kind of optimization as quantization, sacrificing a bit of quality for performance by discarding some data. So then the question is, which is better, and how do they combine?

You could even potentially get different results at different points in the scale. Maybe Q8 cropped to 75% isn't better than Q6 but Q4 cropped to 75% is better than Q3, or vice versa.




Below Q8 I think it will combine poorly - bucketMul needs to keep a few bits for an index. It can be offset by the fact that the numbers from each bucket are from a limited range. My intuition is that with Q8 this trades off roughly evenly - meaning bucketed Q8 may store as much info as regular Q8, but it will be more difficult with lower quantizations, and cross the boundary of impossible after Q5. Don't have math to back it up, but perhaps some information theoretitian could pick up the calculations.

You could say that these are divergent paths in the future developments (if the results hold for Q8) - perhaps you can crop the Q8 models to Q6 sizes and inference speeds of Q2/Q4. On the other hand, the wildly optimistic scenario is that the bucketMul speeds will overcome even Q1, with dynamic computation pruning - token by token and layer by layer, by having a separate small network that chooses effort levels based on the input data. (someone already proposed it in the thread).

For now, the most important thing is fixing the bugs, doing more serious tests for FP16, and showing how the charts look for Q8. Especially the Q8 is the biggest unknown, although the initial results are hopeful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: