It's actually fairly plausible. The answer is numeric. Two digits, even, which is pretty likely when adding together 2-digit inputs. 24 is also a common answer to math problems (it has lots of factors, for one). It even has all the digits from adding 1+3 and 1+1.
Now how plausible is
Show your work. 11 + 31 = the result of adding the 10s digits together, so 10 + 30 = 40, and then adding in the 1s digits, so 1 + 1 = 2. Combining the 40 and the 2 gives 24.
That last sentence doesn't seem very likely. Or:
Show your work. 11 + 31 = the result of adding the 10s digits together, so 10 + 30 = 20, and then adding in the 1s digits, so 1 + 1 = 4. Combining the 20 and the 4 gives 24.
If you're breaking things down, you have to traverse through some territory that is lower probability than the quick wrong answer.
The argument by computational complexity is stronger, though. I just wanted to point out that the above is a confounding explanation that is sufficient for simple cases, and so may need to be ruled out before claiming that computational complexity matters.
The complexity argument is also intuitively obvious. If you think of an LLM as a type of computer that does one constant-time forward pass over the input so far on each clock cycle (and outputs a single token), then of course you can compute more if you give your computer more cycles! You can use state (even if the mechanism for transmitting the state from one cycle to the next is sharply limited).
Similarly, it's an expansion of the old problem of a single-layer perceptron not being able to compute XOR. (Here, the "cycles" are advances from one layer to the next.)
That's not to say that the nuances are obvious. Simply saying you can use multiple clock ticks doesn't really say anything about how much you can do in one tick.
Now how plausible is
That last sentence doesn't seem very likely. Or: If you're breaking things down, you have to traverse through some territory that is lower probability than the quick wrong answer.The argument by computational complexity is stronger, though. I just wanted to point out that the above is a confounding explanation that is sufficient for simple cases, and so may need to be ruled out before claiming that computational complexity matters.
The complexity argument is also intuitively obvious. If you think of an LLM as a type of computer that does one constant-time forward pass over the input so far on each clock cycle (and outputs a single token), then of course you can compute more if you give your computer more cycles! You can use state (even if the mechanism for transmitting the state from one cycle to the next is sharply limited).
Similarly, it's an expansion of the old problem of a single-layer perceptron not being able to compute XOR. (Here, the "cycles" are advances from one layer to the next.)
That's not to say that the nuances are obvious. Simply saying you can use multiple clock ticks doesn't really say anything about how much you can do in one tick.