Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.
What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)
ARC has been challenging precisely because solving its problems often requires:
1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND
2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.
It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.
Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
Quote from the creators of the AGI-ARC benchmark: "Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."
I like the notion, implied in the article, that AGI won't be verified by any single benchmark, but by our collective inability to come up with benchmarks that defeat some eventual AI system. This matches the cat-and-mouse game we've been seeing for a while, where benchmarks have to constantly adapt to better models.
I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.
If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.
I'd need to see what kinds of easy tasks those are and would be happy to revise my hypothesis if that's warranted.
Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.
Personally, I think it's fair to call them "very easy". If a person I otherwise thought was intelligent was unable to solve these, I'd be quite surprised.
Thanks! I've analyzed some easy problems that o3 failed at. They involve spatial intelligence including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.
(OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
> I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.
Maybe! I suppose time will tell. That said, spatial intelligence (connection/movement included) is the whole game in this evaluation set. I think it's revealing that they can't handle these particular examples, and problematic for claims of AGI.
> This skill is very hard to learn from textual and still image data.
I had the same take at first, but thinking about it again, I'm not quite sure?
Take the "blue dots make a cross" example (the second one). The inputs only has four blue dots, which makes it very easy to see a pattern even in text data: two of them have the same x coordinate, two of them have the same y (or the same first-tuple-element and second-tuple-element if you want to taboo any spatial concepts).
Then if you look into the output, you can notice that all the input coordinates are also in the output set, just not always with the same color. If you separate them into "input-and-output" and "output-only", you quickly notice that all of the output-only squares are blue and share a coordinate (tuple-element) with the blue inputs. If you split the "input-and-output" set into "same color" and "color changed", you can notice that the changes only go from red to blue, and that the coordinates that changed are clustered, and at least one element of the cluster shares a coordinate with a blue input.
Of course, it's easy to build this chain of reasoning in retrospect, but it doesn't seem like a complete stretch: each step only requires noticing patterns in the data, and it's how a reasonably puzzle-savvy person might solve this if you didn't let them draw the squares on papers. There are a lot of escape games with chains of reasoning much more complex and random office workers solve them all the time.
The visual aspect makes the patterns jump to us more, but the fact that o3 couldn't find them at all with thousands of dollars of compute budget still seems meaningful to me.
EDIT: Actually, looking at Twitter discussions[1], o3 did find those patterns, but was stumped by ambiguity in the test input that the examples didn't cover. Its failures on the "cascading rectangles" example[2] looks much more interesting.
You've never met a Doctor who couldn't figure out how to work their email? Or use street smarts? You can have a PHD but be unable to reliably handle soft skills, or any number of things you might 'expect' someone to be able to do.
Just playing devils' advocate or nitpicking the language a bit...
An important distinction here is you’re comparing skill across very different tasks.
I’m not even going that far, I’m talking about performance on similar tasks. Something many people have noticed about modern AI is it can go from genius to baby-level performance seemingly at random.
Take self driving cars for example, a reasonably intelligent human of sound mind and body would never accidentally mistake a concrete pillar for a road. Yet that happens with self-driving cars, and seemingly here with ARC-AGI problems which all have a similar flavor.
A coworker of mine has a phd in physics. Showing the difference to him between little and big endian in a hex editor, showing file sizes of raw image files and how to compute it... I explained 3 times and maybe he understood part of it now.
Doctors[1] or say pilots are skilled professions and difficult to master and deserve respect yes , but they do not need high levels of intelligence to be good at. They require many other skills like taking decisions under pressure or good motor skills that are hard, but not necessarily intelligence.
Also not knowing something is hardly a criteria , skilled humans focus on their areas of interest above most other knowledge and can be unaware of other subjects.
Fields medal winners for example may not be aware of most pop culture things doesn’t make them not able to do so, just not interested
—-
[1] most doctors including surgeons and many respected specialists, some doctors however do need that skills but those are specialized few and generally do know how to use email
A PHD learnt their field. If they learnt that field, reasoning through everything to understand their material, then - given enough time - they are capable of learning email and street smarts.
Which is why a reasoning LLM, should be able to do all of those things.
they say it isn't AGI but i think the way o3 functions can be refined to AGI - it's learning to solve a new, novel problems. we just need to make it do that more consistently, which seems achievable
Have we really watered down the definition of AGI that much?
LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.
Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.
> LLMs aren't really capable of "learning" anything outside their training data.
ChatGPT has had for some time the feature of storing memories about its conversations with users. And you can use function calling to make this more generic.
I think drawing the boundary at “model + scaffolding” is more interesting.
Thats the whole point of llama index? I can connect my LLM to any node or context i want. Syncing it to a real time data flow like an API and it can learn...? How is that different than a human?
Once optimus is up an working by the 100k+, the spatial problems will be solved. We just don't have enough spatial awareness data, or for a way for the LLM to learn about the physical world.
That's true for vanilla LLMs, but also keep in mind that there are no details about o3's architecture at the moment. Clearly they are doing something different given the huge performance jump on a lot of benchmarks, and it may well involve in-context learning.
My point was to caution against being too confident about the underlying architecture, not to argue for any particular alternative.
Your statement is false - things changed a lot between gpt4 and o1 under the hood, but notably not a larger model size. In fact the model size of o1 is smaller than gpt4 by several orders of magnitude! Improvements are being made in other ways.
What's your explanation for why it can only get ~70% on SWE-bench Verified?
I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)
I've spent tons of time evaluating o1-preview on SWEBench-Verified.
For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.
For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.
One possibility is that it may not yet have sufficient experience and real-world feedback for resolving coding issues in professional repos, as this involves multiple steps and very diverse actions (or branching factor, in AI terms). They have committed to not training on API usage, which limits their ability to directly acquire training data from it. However, their upcoming agentic efforts may address this gap in training data.
Right, but the branching factor increases exponentially with the scope of the work.
I think it's obvious that they've cracked the formula for solving well-defined, small-in-scope problems at a superhuman level. That's an amazing thing.
To me, it's less obvious that this implies that they will in short order with just more training data be able to solve ambiguous, large-in-scope problems at even just a skilled human level.
There are far more paths to consider, much more context to use, and in an RL setting, the rewards are much more ambiguously defined.
Their reasoning models can learn from procedures and methods, which generalize far better than data. Software tasks are diverse but most tasks are still fairly limited in scope. Novel tasks might remain challenging for these models, as they do for humans.
That said, o3 might still lack some kind of interaction intelligence that’s hard to learn. We’ll see.
GPQA scores are mostly from pre-training, against content in the corpus. They have gone silent but look at the GPT4 technical report which calls this out.
We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.
As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.
As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.
I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.
Neither TC0 nor uniform-TC0 are physically realizable, they are tools not physical devices.
The default nonuniform circuits classes are allowed to have a different circuit per input size, the uniform types have unbounded fan-in
Similar to how a k-tape TM doesn't get 'charged' for the input size.
With Nick Class (NC) the number of components is similar to traditional compute time while depth relates to the ability to parallelize operations.
These are different than biological neurons, not better or worse but just different.
Human neurons can use dendritic compartmentalization, use spike timing, can retime spikes etc...
While the perceptron model we use in ML is useful, it is not able to do xor in one layer, while biological neurons do that without anything even reaching the soma, purely in the dendrites.
Statistical learning models still comes down to a choice function, no matter if you call that set shattering or...
With physical computers the time hierarchy does apply and if TIME(g(n)) is given more time than TIME(f(n)), g(n) can solve more problems.
So you can simulate a NTM with exhaustive search with a physical computer.
Physical computers also tend to have NAND and XOR gates, and can have different circuit depths.
When you are in TC0, you only have AND, OR and Threshold (or majority) gates.
Think of instruction level parallelism in a typical CPU, it can return early, vs Itanium EPIC, which had to wait for the longest operation. Predicated execution is also how GPUs work.
They can send a mask and save on load store ops as an example, but the cost of that parallelism is the consent depth.
It is the parallelism tradeoff that both makes transformers practical as well as limit what they can do.
The IID assumption and autograd requiring smooth manifolds plays a role too.
The frame problem, which causes hard problems to become unsolvable for computers and people alike does also.
But the fact that we have polynomial time solutions for the Boolean Formula Value Problem, as mentioned in my post above is probably a simpler way of realizing physical computers aren't limited to uniform-TC0.
Do you just mean because any physically realizable computer is a finite state machine? Or...?
I wouldn't describe a computer's usual behavior as having constant depth.
It is fairly typical to talk about problems in P as being feasible (though when the constant factors are too big, this isn't strictly true of course).
Just because for unreasonably large inputs, my computer can't run a particular program and produce the correct answer for that input, due to my computer running out of memory, we don't generally say that my computer is fundamentally incapable of executing that algorithm.
>Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforce, AIME, and Frontier Math strongly suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it.
The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?
Great point. I'd love to see what these easy tasks are and would be happy to revise my hypothesis accordingly. o3's intelligence is unlikely to be a strict superset of human intelligence. It is certainly superior to humans in some respects and probably inferior in others. Whether it's sufficiently generally intelligent would be both a matter of definition and empirical fact.
Please stop it calling AGI, we don’t even know or agree universally what that should actually mean. How far did we get with hype calling a lossy probabilistic compressor firing slowly at us words AGI? That’s a real bummer to me
Personally I find "human-level" to be a borderline meaningless and limiting term. Are we now super human as a species relative to ourselves just five years ago because of our advances in developing computer programs that better imitate what many (but far from all) of us were already capable of doing? Have we reached a limit to human potential that can only be surpassed by digital machines? Who decides what human level is and when we have surpassed it? I have seen some ridiculous claims about ai in art that don't stand up to even the slightest scrutiny by domain experts but that easily fool the masses.
No I think we're just tired and depressed as a species... Existing systems work to a degree but aren't living up to their potential of increasing happiness according to technological capabilities.
The problem with ARC is that there are a finite number of heuristics that could be enumerated and trained for, which would give model a substantial leg up on this evaluation, but not be generalized to other domains.
For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.
Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.
I mean fkcu me when they have those things, however, maybe they are just lazy and their judgement is fine, for a lazy intelligence. Inner-self thinks "why are these bastards asking me to do this? ". I doubt that is actually happening, but now, .. prove it isn't.
What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)
ARC has been challenging precisely because solving its problems often requires:
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.
[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
ADDED:
Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)