Just dealing with numbers and numeric comparisons is a big improvement with mode...

jjeaff · on May 9, 2023

Is a larger model better with numbers simply because it is more likely to have examples that use those same numbers? Or because it somehow gives it better ability to reason about numbers?

mlyle · on May 9, 2023

Right now, larger models have more complicated and rich structures encoding information about numbers and the meanings of their parts.

There's fundamental awkwardness that comes with doing math using a device that only seeks to predict the "next token" coming out, and that only understands numbers as a sequence of tokens (usually digits in base 10). It also doesn't even start with the knowledge of the ordering of the digits: this just comes from the examples it has seen.

Either it must:

- "Think ahead" inside the linear algebra of the model, so that it has already carried all the digits, etc. There are no real "steps" in this operation that are akin to the things we think about when we do arithmetic.

- Describe what it is doing, so that the intermediate work is inside its context buffer.

Right now, the models have learned structures that reliably think 3-4 digits ahead in most cases, which is way better than before but still pretty bad compared to a competent 4th grader taking their time with arithmetic. But if you create a scenario where the model describes its reasoning, it can do pretty well.

PeterisP · on May 9, 2023

> that only understands numbers as a sequence of tokens (usually digits in base 10).

You wish!

A base-10 representation would make it much easier for the model, but the current tokenization merges digits according to their frequency, so (at least for GPT-3.5) 50100 gets tokenized as "501"/"00" and 50200 gets tokenized as "50"/"200", which makes it tricky to compare them or do math with them. Also, if you ask it "How many zeroes does 50100 contain", the relationship between "501" and "0" needs to be learned purely from the training data, as after the tokenization the model only gets the ID of the token representing "501" which has no data about its composition.

We use Arabic numerals because their positional encoding makes arithmetic easier, but language models receive the same data without positional encoding, they get given something that's more like an extreme version of Roman numerals.

mlyle · on May 9, 2023

> but the current tokenization merges digits according to their frequency

Haha, that's even worse. I've not looked at the tokenization in depth; I just assumed digits were individual symbols. Thank you for the correction.

Any idea why this tokenization was used for digits? I understand that being blind to the input content and just learning a tokenization through frequency analysis has its merits for language, but the whole number thing seems awful. Any benefit on density fitting into context window seems worthless with how much harder it makes understanding of what the numbers mean.

PeterisP · on May 9, 2023

The simple answer is that the same subword tokenization algorithm is used for everything, for all symbols of all languages in all alphabets and of all domains (books, tweets, code, etc) and for all other symbols like emoji, which include combined characters, punctuation. If you'd optimize for digit-specific tasks, it would make all sense to have special treatment for digits, but the current widely used models don't seem to do that, at least GPT up to GPT-3.5 doesn't - you can try it out here https://platform.openai.com/tokenizer . And it kind of makes sense, because in actual usage seen in training data IMHO digits are most likely not used for math to represent decimal integers, they're used as phone numbers or components of identifiers like "GPT-3" or parts of mail addresses, things like that which are more common in textual data than math.

mlyle · on May 9, 2023

I dunno. Sometimes a group of numbers has a non-mathematical semantic meaning that's a good mapping to digits-- like an area code or '777'. A lot of the rest of the time it's pretty random. A tokenizer's job is to lower the size of the input vector for a given amount of input meaning without obscuring the real underlying relationships too much, and here it feels like it doesn't meet that goal.

My phone number is 6 tokens instead of 12 symbols... so this is only going to make a moderate difference on things like big lists of phone numbers.

RugnirViking · on May 9, 2023

the larger model doesn't have a notably larger dataset to my understanding. It's just got more parameters, so learns higher-order abstractions about the dataset