These results are fantastic. Claude 3.5 and o1 are already good enough to provide value, so I can't wait to see how o3 performs comparatively in real-world scenarios.
But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.
Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.
(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)
But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.
Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.
(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)