Running models locally using GPU inference shouldn't be too bad as the biggest i...

Running models locally using GPU inference shouldn't be too bad as the biggest impact in terms of performance is ram/vram bandwidth rather than compute. Some rough power figures for a dual AMD GPU setup (24gb vram total) on a 5950x (base power usage of around 100w) using llama.cpp (i.e., a ChatGPT style interface, not Copilot):

46b Mixtral q4 (26.5 gb required) with around 75% in vram: 15 tokens/s - 300w at the wall, nvtop reporting GPU power usage of 70w/30w, 0.37kWh

46b Mixtral q2 (16.1 gb required) with 100% in vram: 30 tokens/s - 350w, nvtop 150w/50w, 0.21kWh.

Same test with 0% in vram: 7 tokens/s - 250w, 0.65kWh

7b Mistral q8 (7.2gb required) with 100% in vram: 45 tokens/s - 300w, nvtop 170w, 0.12kWh

The kWh figures are an estimate for generating 64k tokens (around 35 minutes at 30 tokens/s), it's not an ideal estimate as it only assumes generation and ignores the overhead of prompt processing or having longer contexts in general.

The power usage essentially mirrors token generation speed, which shouldn't be too surprising. The more of the model you can load into fast vram the faster tokens will generate and the less power you'll use for the same amount of tokens generated. Also note that I'm using mid and low tier AMD cards, with the mid tier card being used for the 7b test. If you have an Nvidia card with fast memory bandwidth (i.e., a 3090/4090), or an Apple ARM Ultra, you're going to see in the region of 60 tokens/s for the 7b model. With a mid range Nvidia card (any of the 4070s), or an Apple ARM Max, you can probably expect similar performance on 7b models (45 t/s or so). Apple ARM probably wins purely on total power usage, but you're also going to be paying an arm and a leg for a 64gb model which is the minimum you'd want to run medium/large sized models with reasonable quants (46b Mixtral at q6/8, or 70b at q6), but with the rate models are advancing you may be able to get away with 32gb (Mixtral at q4/6, 34b at q6, 70b at q3).

I'm not sure how many tokens a Copilot style interface is going to churn though but it's probably in the same ballpark. A reasonable figure for either interface at the high end is probably a kWh a day, and even in expensive regions like Europe it's probably no more than $15/mo. The actual cost comparison then becomes a little complicated, spending $1500 on 2 3090s for 48gb of fast vram isn't going to make sense for most people, similarly making do with whatever cards you can get your hands on so long as they have a reasonable amount of vram probably isn't going to pay off in the long run. It also depends on the size of the model you want to use and what amount of quantisation you're willing to put up with, current 34b models or Mixtral at reasonable quants (q4 at least) should be comparable to ChatGPT 3.5, future local models may end up getting better performance (either in terms of generation speed or how smart they are) but ChatGPT 5 may blow everything we have now out of the water. It seems far too early to make purchasing decisions based on what may happen, but most people should be able to run 7b/13b and maybe up to 34/46b models with what they have and not break the bank when it comes time to pay the power bill.