Hacker News new | past | comments | ask | show | jobs | submit login

I don't think speculative decoding proves that they consume less/more energy per question.

Regardless if the question/prompt is simple or not (for any definition of simple), if the target output is T tokens, the larger model needs to generate at least T tokens, if the small and large models disagree then the large model will be called to generate more than T tokens. The observed speedup is because you can infer K+1 tokens in parallel based on the drafts of the smaller model instead of having to do it sequentially. But I would argue that the "important" computation is still done (also the smaller model will be called the same number of times regardless of the difficulty of the question, bringing us back to the same problem that LLMs won't vary their energy consumption dynamically as a function of question complexity).

Also, the rate of disagreement does not necessarily change when the question is more complex, it could be that the 2 models have learned different things and could disagree on a "simple" question.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: