There are plenty of results supporting my assertion; but the tests must be caref...

There are plenty of results supporting my assertion; but the tests must be carefully designed. Of course, LLMs are not databases that store exact answers - so it's not enough to ask it something that it hasn't seen, if it's seen something similar (as is likely the case with your programming language).

One benchmark that I track closely is ConceptARC, which aims to test generalization and abstraction capabilities.

Here is a very recent result that uses the benchmark: https://arxiv.org/abs/2311.09247. Humans correctly solved 91% of the problems, GPT-4 solved 33%, and GPT-4V did much worse than GPT-4.