But Zork wouldn't be a very accurate measure of skill because GPT definitely knows Zork. Unfortunately the emulator (https://github.com/DLehenbauer/jszm) doesn't work with most games newer than Zork. I haven't revisited the code with newer GPT models either.
I was surprised how high your costs were. I assume you are putting the entire transcript into each prompt, but even then that seems high. Is GPT's planning also taking up a lot of room?
I did find giving GPT some hints about the known commands helped a lot, and I put in some detection of error messages and kept a running log of commands that wouldn't work. Getting it to navigate the parser is kind of half of the skill of playing one of these games. It would be interesting to have it play some, then step back and have it reflect and enumerate things about how the play itself works.
The costs have dropped significantly months after I created the cost image.
Now I use GPT-4 Turbo. This GPT-4 model understand how text adventures work and there is no need to give him known commands.
Of course you try even more sophisticated techniques than mine. I tried the ReAct pattern and virtual discussions. So far, he always stumbles at the same place in a critical understanding of the text. And I tried exactly this critical step dozens of times.
You will understand the issue yourself, once you play the game yourself. It just takes 20 minutes and is very easy:
You mean at the very end of the game? The game seems like it's only designed to trick you into that very ending :) Are you hoping it will figure out the game based on the context clues? I'm not sure I can find them myself...
A long time ago I did some exercises in "classical planning algorithms", which all feel very like the early part of this game. I.e., how do you get ready to leave if you have to shower, and can't do that with clothes on, etc. A similar planning example involved changing a tire (opening the trunk, removing lug nuts, etc). It was surprisingly difficult to make an algorithm that could figure it out! You could search the state space given the transitions, but it exploded with what was effectively lots of dead ends; obvious to me as a human, but not to the algorithm. Which is to say that this is a harder problem than it might seem.
Yes, that is the first "bad" ending. After that follow just the one relevant context clue and look under the bed. That might be already enough.
I chose this game, because the game just helps you, at the every step, what you have to do next. Not much to try out. Just the narrative changes. One time, you have to go to work and one time you have to flee.
Other text adventures are even more problematic. I saw GPT-4 trying for dozens of steps in the "The Hitchhiker's Guide to the Galaxy" adventure just to turn on the lights. And this just the first command you have to get right in the game.
But Zork wouldn't be a very accurate measure of skill because GPT definitely knows Zork. Unfortunately the emulator (https://github.com/DLehenbauer/jszm) doesn't work with most games newer than Zork. I haven't revisited the code with newer GPT models either.