A field can seem to be going quickly and going nowhere at the same time. Or rather a new technique can be invented and then exhausted in the time it takes somebody to get a PhD. (See https://en.wikipedia.org/wiki/Renormalization_group applied to phase transitions, which turned up just in time for the physics job crisis of 1970)
I didn't ever believe that there was going to be a GPT-5 trained with exponentially more text and resources. Not only is there not enough text, but that's the path to ruin. Why?
Cycle time. Two years ago we had little idea of how those models work so I knew there was a huge room in improving performance. It gets the cost down, it lets you put the models on your device, and it speeds up development. If I can train 10 models in the time it takes you to train 1 model I can make much faster progress.
However even a GPT-15 trained with a Dyson sphere is going to struggle to sort things. (Structurally a pure LLM can't do that!) My #1 beef with Microsoft's Copilot is that you can ask it if it can sort a certain list of items (either a list you are discussing with it or say "states of the United States ordered by percent water area") it will say yes and if you ask it what it thinks the probability is that it will get it in the right order it will say "very high" but when you try it the list comes out totally wrong.
It is equally unable to "help me make an atom bomb" except in the bomb case it will say that it can't but in the sorting case it says it can.
The obvious answer is that it should use tools to sort. That's right but the problem of "knowing what you can really do with your tools" is philosophically challenged. (With problems so intractable it leads people like Roger Penrose to conclude "I couldn't do math if I wasn't a thetan")
I'm deliberately blurring refusal with having an accurate picture of its own abilities and, past that, having an accurate picture of of what it can do given tools. Both are tested by
"Can you X?"
With refusal you find just how shallow it is because it really will answer all sorts of questions that are "helpful" in making a nuclear bomb but when you ask it directly it shuts up. In another sense nothing it does is "helpful" because it's not going to hunt down some people in central asia who have 50kg of U235 burning a hole in their pocket for you, which is what would actually "help".
I use tool using LLMs frequently, but I find they frequently need help using their tools, it is a lot of fun to talk to Windsurf about the struggles it has with its tools and it feels strangely satisfying to help it out.
I didn't ever believe that there was going to be a GPT-5 trained with exponentially more text and resources. Not only is there not enough text, but that's the path to ruin. Why?
Cycle time. Two years ago we had little idea of how those models work so I knew there was a huge room in improving performance. It gets the cost down, it lets you put the models on your device, and it speeds up development. If I can train 10 models in the time it takes you to train 1 model I can make much faster progress.
However even a GPT-15 trained with a Dyson sphere is going to struggle to sort things. (Structurally a pure LLM can't do that!) My #1 beef with Microsoft's Copilot is that you can ask it if it can sort a certain list of items (either a list you are discussing with it or say "states of the United States ordered by percent water area") it will say yes and if you ask it what it thinks the probability is that it will get it in the right order it will say "very high" but when you try it the list comes out totally wrong.
It is equally unable to "help me make an atom bomb" except in the bomb case it will say that it can't but in the sorting case it says it can.
The obvious answer is that it should use tools to sort. That's right but the problem of "knowing what you can really do with your tools" is philosophically challenged. (With problems so intractable it leads people like Roger Penrose to conclude "I couldn't do math if I wasn't a thetan")