It's interesting that so many of the model's fail to retrieve this, but any thta do solve it should clearly be able to do so with no reasoning/theory of mind.
I agree this is not a great test. What's good about it is that it is a constraint satisfaction problem, and I would expect LLMs to be pretty bad at unknown problems of this kind. Simple reason, an LLM only has a a finite number of layers and it cannot do arbitrary long searches.
I almost made ChatGPT write a Python program that creates a monthly work schedule (for imaginary workers) based on specific constraints (e.g. there are 10 workers, 2 shifts (morning and night), must work 40 hours per week, must have at least one weekend in a month off, 2 minimum workers per shift, no more than 3 consecutive working days, and so forth).
I am not sure if I could make it give me a working solution, however, and I have not tried Claude, for example, and I have not tried to do it with other programming languages. Maybe.
The issue was that it messed up the constraints and there were no feasible solutions, that said, it did give me a working program for this that had fewer constraints.
I don't understand what you're saying - the idea is that we're asking the LLM to generate code to perform the search, rather than run an arbitrarily long search on its own, right? So why should the number of layers it has matter?
It's interesting that so many of the model's fail to retrieve this, but any thta do solve it should clearly be able to do so with no reasoning/theory of mind.