Should the number 23 be tokenized as one token or two tokens?

doctorpangloss · 2024-10-24T03:37:24 1729741044

It doesn’t matter. The challenge with counting doesn’t have to do with tokenization. Why this got into the zeitgeist, I don’t know.

imtringued · 2024-10-24T07:25:04 1729754704

No LLM struggles with two digit arithmetic. 100 digit addition is possible with the use of state of the art position encodings. Counting is not bottlenecked by arithmetic at all.

When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.

What you are doing is essentially denying that the problem exists.

ithkuil · 2024-10-24T12:42:37 1729773757

Gpt4-o correctly answered

"How many letters "r" are in the word Frurirpoprar"

And it didn't use a program execution (at least it didn't show the icon and the answer was very fast so it's unlikely it generated an executed a program to count)

danielmarkbruce · 2024-10-24T16:43:52 1729788232

I wouldn't consider that a thing that's going to work generally. That word may tokenize to one per char and have seen relevant data, or it may be relatively close to some other word and it's seen data which gives the answer.

Der_Einzige · 2024-10-26T08:23:33 1729931013

Why would you confidently say such a lie like this? It's exactly the opposite. It's mostly due to toeknization. Show NeurIPS papers which give evidence of the opposite because I can square up with NeurIPS papers to substantiate that it is tokenization that causes these issues.

danielmarkbruce · 2024-10-24T13:29:58 1729776598

Tokenization absolutely screws with math and counting.

thaumasiotes · 2024-10-24T01:02:24 1729731744

Two. That's the reality.

You interpret the token sequence by constructing a parse tree, but that doesn't require you to forget that the tokens exist.

danielmarkbruce · 2024-10-24T02:00:21 1729735221

If you use standard BPE, you likely won't tokenize every number by it's digits, depending on the data set used to create the tokenizer.

The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.

The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.

tomrod · 2024-10-24T00:06:33 1729728393

We already solved that with binary representation ;-)