Lower latency, but also much lower power. This sort of thing would be of great interest to companies running AI datacenters (which is why Microsoft is doing this research, I'd think). Low latency is also quite useful for real-time tasks.
> The code itself is surprisingly small/tight. I'm been playing with llama.cpp for the last few days.
> The code itself is surprisingly small/tight. I'm been playing with llama.cpp for the last few days.
Is there a bitnet model that runs on llama.cpp? (looks like it: https://www.reddit.com/r/LocalLLaMA/comments/1dmt4v7/llamacp...) which bitnet model did you use?