Is this true? >> Do not panic! A lot of the large LLM vocabularies are pretty hu...

macleginn · 2024-10-23T22:04:50 1729721090

Most of these 1+ million words are almost never used, so 200k is plenty for English. Optimistically, we hope that rarer words would be longer and to some degree compositional (optim-ism, optim-istic, etc.), but unfortunately this is not what tokenisers arrive at (and you are more likely to get "opt-i-mis-m" or something like that). People have tried to optimise tokenisation and the main part of LLM training jointly, which leads to more sensible results, but this is unworkable for larger models, so we are stuck with inflated basic vocabularies.

It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.

Der_Einzige · 2024-10-23T23:07:46 1729724866

Performance would be massively improved on constrained text tasks. That alone makes it worth it to expand the vocabulary size.

mmoskal · 2024-10-23T20:24:59 1729715099

Tokens are often sub-word, all the way down to bytes (which are implicitly understood as UTF8 but models will sometimes generate invalid UTF8...).

spott · 2024-10-24T03:16:13 1729739773

BPE is complete. Every valid Unicode string can be encoded with any BPE tokenizer.

BPE basically starts with a token for every valid value for a Unicode byte and then creates new tokens by looking at common pairs of bytes (‘t’ followed by ‘h’ becomes a new token ’th’)