> The training token count is tripled (6T vs. Llama2's 2T) Damn, 6T? That's a lo... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

andy_xor_andrew 11 months ago | parent | context | favorite | on: Gemma: New Open Models

> The training token count is tripled (6T vs. Llama2's 2T)

Damn, 6T? That's a lot!

Given that this model seems to roughly match Mistral (according to the numbers from Google), this makes me think we have saturated the 7B parameter space, and couldn't possibly make it much better unless new techniques are discovered.

espadrine 11 months ago [–]

Hard to say definitively. Mistral’s token embeddings only account for <2% of the 7B parameters, while Gemma’s larger token vocabulary vampirized over 10%, leaving less space for the more important parts of the network. It is a somewhat surprising tradeoff given that it was pretrained towards an English bias.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact