Hacker News new | past | comments | ask | show | jobs | submit login

> The training token count is tripled (6T vs. Llama2's 2T)

Damn, 6T? That's a lot!

Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.




Hard to say definitively. Mistral’s token embeddings only account for <2% of the 7B parameters, while Gemma’s larger token vocabulary vampirized over 10%, leaving less space for the more important parts of the network. It is a somewhat surprising tradeoff given that it was pretrained towards an English bias.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: