This is indeed a bit of a dark art. Essentially, you want a balance between "is significantly faster than base model" and "generates similar stuff to the base model".
Anecdotally, folks often seem to use say, 70B base + 7B as verifier. But I think there's a lot of room for experimentation and improvement here.
You could... say, take a 70B model and maybe just chop off the last 90% of layers and then fine-tune. Or perhaps you could use a model that's trained to generate 8 tokens at once. Or perhaps you could just use statistical "n-gram" predictor.
Anecdotally, folks often seem to use say, 70B base + 7B as verifier. But I think there's a lot of room for experimentation and improvement here.
You could... say, take a 70B model and maybe just chop off the last 90% of layers and then fine-tune. Or perhaps you could use a model that's trained to generate 8 tokens at once. Or perhaps you could just use statistical "n-gram" predictor.