Qwen coder 32b instruct is the state of the art for local LLM coding and will ru...

Qwen coder 32b instruct is the state of the art for local LLM coding and will run with a smallish context with that on a 64GB laptop with partial GPU offload. Probably around .8 tok/sec.

With a quantization of it you can run larger contexts and go a bit faster. 1.4 tok/sec at 8b quant with offload to a 6GB laptop GPU.

Speculative decoding has been being added to lots of the runtimes recently and can give a 20-30% boost with a 1 billion weight model running the speculative token stream.