Understanding and assimilation can lead to generating relations between disjoint sets of tokens.
For example, "squeeze a tube to cut water flow" and "put pressure on a deep wound to stop blood loss" can only be related, if not already in the training data, if there is understanding and intelligence.
The ability to do that is intelligence.
Otherwise, it's just a search and optimization problem.
Firstly, transformer architectures are not "just a search and optimisation problem". They do generalise. Whether that generalisation is sufficient to be structurally equivalent to what we consider intelligence is an open question, but getting them to demonstrate that they generalise is easy (eg. ChatGPT can do math, albeit badly, with numbers large enough that it is infeasible for it to just have occurred in its training set)
Secondly, this poses the problem of 1) finding examples like the one you gave that it can't understand (regarding your specific example, see [1]), 2) ruling out that there was something "too close" drawing the equivalence in the training data, 3) ruling out that the failure to draw the equivalence is something structural (it can't have real understanding) rather than qualitative (it has real understanding, but it just isn't smart enough to understand the specific given problem)
So I'm back to my original question of how we would know if these are structurally different things in the first place.
[1] vidarh: How does squeezing a tube to cut water flow give you a hint as to what to do about a deep wound?
ChatGPT (GPT4): Squeezing a tube to cut off water flow demonstrates the basic principle of applying pressure to restrict or stop the flow of a fluid. Similarly, applying pressure to a deep wound can help control bleeding, which is a critical first aid measure when dealing with serious injuries.
[followed by a long list of what to do if encountering a deep wound]