No LLM struggles with two digit arithmetic. 100 digit addition is possible with the use of state of the art position encodings. Counting is not bottlenecked by arithmetic at all.
When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.
What you are doing is essentially denying that the problem exists.
"How many letters "r" are in the word Frurirpoprar"
And it didn't use a program execution (at least it didn't show the icon and the answer was very fast so it's unlikely it generated an executed a program to count)
I wouldn't consider that a thing that's going to work generally. That word may tokenize to one per char and have seen relevant data, or it may be relatively close to some other word and it's seen data which gives the answer.
Why would you confidently say such a lie like this? It's exactly the opposite. It's mostly due to toeknization. Show NeurIPS papers which give evidence of the opposite because I can square up with NeurIPS papers to substantiate that it is tokenization that causes these issues.
If you use standard BPE, you likely won't tokenize every number by it's digits, depending on the data set used to create the tokenizer.
The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.
The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.