> The training token count is tripled (6T vs. Llama2's 2T)
Damn, 6T? That's a lot!
Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.
Hard to say definitively. Mistral’s token embeddings only account for <2% of the 7B parameters, while Gemma’s larger token vocabulary vampirized over 10%, leaving less space for the more important parts of the network. It is a somewhat surprising tradeoff given that it was pretrained towards an English bias.
Damn, 6T? That's a lot!
Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.