This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.
My personal 5 cents is that reasoning will be there when LLM gives you some kind of outcome and then when questioned about it can explain every bit of result it produced.
For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.
And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.
I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.
I see two approaches for explaining the outcome:
1. Reasoning back on the result and justifying it.
2. Explainability - somehow justifying by looking at which neurons have been called.
The first could lead to lying. E.g. think of a high schooler explaining copied homework.
While the second one does indeed access the paths influencing the decision, but is a hard task due to the inherent way neural networks work.
No, I'm not confusing it. I realize that LLMs sometimes connect with diffusion models to produce images. I'm talking about language models actually describing pixel data of the image.
I think it's hard to enumerate the unknown, but I'd personally love to see how models like this perform on things like word problems where you introduce red herrings. Right now, LLMs at large tend to struggle mightily to understand when some of the given information is not only irrelevant, but may explicitly serve to distract from the real problem.
That’s not inability to reason though, that’s having a social context.
Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.
What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.
LLMs are still bound to a prompting session. They can't form long term memories, can't ponder on it and can't develop experience. They have no cognitive architecture.
'Agents' (i.e. workflows intermingling code and calls to LLMs) are still a thing (as shown by the fact there is a post by anthropic on this subject on the front page right now) and they are very hard to build.
Consequence of that for instance: it's not possible to have a LLM explore exhaustively a topic.
LLMs don’t, but who said AGI should come from LLMs alone. When I ask ChatGPT about something “we” worked on months ago, it “remembers” and can continue on the conversation with that history in mind.
I’d say, humans are also bound to promoting sessions in that way.
Last time I used ChatGPT 'memory' feature it got full very quickly. It remembered my name, my dog's name and a couple tobacco casing recipes he came up with. OpenAI doesn't seem to be using embeddings and a vector database, just text snippets it injects in every conversation. Because RAG is too brittle ? The same problem arises when composing LLM calls. Efficient and robust workflows are those whose prompts and/or DAG were obtained via optimization techniques. Hence DSPy.
Consider the following use case: keeping a swimming pool water clean. I can have a long running conversation with a LLM to guide me in getting it right. However I can't have a LLM handle the problem autonomously. I'd like to have it notify me on its own "hey, it's been 2 days, any improvement? Do you mind sharing a few pictures of the pool as well as the ph/chlorine test results ?". Nothing mind-boggingly complex. Nothing that couldn't be achieved using current LLMs. But still something I'd have to implement myself and which turns out to be more complex to achieve than expected. This is the kind of improvement I'd like to see big AI companies going after rather than research-grade ultra smart AIs.
kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.
people with (high) intelligence talking and building (artificial) intelligence but never able to convincingly explain aspects of intelligence. just often talk ambiguously and circularly around it.
what are we humans getting ourselves into inventing skynet :wink.
its been an ongoing pet project to tackle reasoning, but i cant answer your question with regards to llms.
>> Kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.
Kinda interesting that mathematicians also can't do the same for mathematics.
Mathematicians absolutely can, it's called foundations, and people actively study what mathematics can be expressed in different foundations. Most mathematicians don't care about it though for the same reason most programmers don't care about Haskell.
I don't care about Haskell either, but we know what reasoning is [1]. It's been studied extensively in mathematics, computer science, psychology, cognitive science and AI, and in philosophy going back literally thousands of years with grandpapa Aristotle and his syllogisms. Formal reasoning, informal reasoning, non-monotonic reasoning, etc etc. Not only do we know what reasoning is, we know how to do it with computers just fine, too [2]. That's basically the first 50 years of AI, that folks like His Nobelist Eminence Geoffrey Hinton will tell you was all a Bad Idea and a total failure.
Still somehow the question keeps coming up- "what is reasoning". I'll be honest and say that I imagine it's mainly folks who skipped CS 101 because they were busy tweaking their neural nets who go around the web like Diogenes with his lantern, howling "Reasoning! I'm looking for a definition of Reasoning! What is Reasoning!".
I have never heard the people at the top echelons of AI and Deep learning - LeCun, Schmidhuber, Bengio, Hinton, Ng, Hutter, etc etc- say things like that: "what's reasoning". The reason I suppose is that they know exactly what that is, because it was the one thing they could never do with their neural nets, that classical AI could do between sips of coffee at breakfast [3]. Those guys know exactly what their systems are missing and, to their credit, have never made no bones about that.
_________________
[1] e.g. see my profile for a quick summary.
[2] See all of Russeel & Norvig, as a for instance.
[3] Schmidhuber's doctoral thesis was an implementation of genetic algorithms in Prolog, even.
i have a question for you, in which ive asked many philosophy professors but none could answer satisfactorily. since you seem to have a penchant for reasoning perhaps you might have a good answer. (i hope i remember the full extent of the question properly, i might hit you up with some follow questions)
it pertains to the source of the inference power of deductive inference. do you think all deductive reasoning originated inductively? like when some one discovers a rule or fact that seemingly has contextual predictive power, obviously that can be confirmed inductively by observations, but did that deductive reflex of the mind coagulate by inductive experiences. maybe not all deductive derivative rules but the original deductive rules.
I'm sorry but I have no idea how to answer your question, which is indeed philosophical. You see, I'm not a philosopher, but a scientist. Science seeks to pose questions, and answer them; philosophy seeks to pose questions, and question them. Me, I like answers more than questions so I don't care about philosophy much.
well yeah its partially philosphical, i guess my haphazard use of language like “all” makes it more philosophical than intended.
but im getting at a few things.
one of those things is neurological. how do deductive inference constructs manifest in neurons and is it really inadvertently an inductive process that that creates deductive neural functions.
other aspect of the question i guess is more philosophical. like why does deductive inference work at all, i think clues to a potential answer to that can be seen in the mechanics of generalization of antecedents predicting(or correlating with) certain generalized consequences consistently. the brain coagulates generalized coinciding concepts by reinforcement and it recognizes or differentiates inclusive instances or excluding instances of a generalization by recognition properties that seem to gatekeep identities accordingly. its hard to explain succinctly what i mean by the latter, but im planning on writing an academic paper on that.
To clarify, what neural nets are missing is a capability present in classical, logic-based and symbolic systems. That's the ability that we commonly call "reasoning". No need to prove any negatives. We just point to what classical systems are doing and ask whether a deep net can do that.
well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.
i doubt your mathmatician example is equivalent.
examples that are fresh on the mind that further my point.
ive heard yann lecun baffled by llms instantiation/emergence of reasoning, along with other ai researchers. eric Schmidt thinks the agentic reasoning is the current frontier and people should be focusing on that. was listening to the start of an ai machine learning interview a week ago with some cs phd asked to explain reasoning and the best he could muster up is you know it when you see it…. not to mention the guy responding to the grandparent that gave a cop out answer ( all the most respect to him).
>> well lets just say i think i can explain reasoning better than anyone ive encountered. i have my own hypothesized theory on what it is and how it manifests in neural networks.
I'm going to bet you haven't encountered the right people then. Maybe your social circle is limited to folks like the person who presented a slide about A* to a dumb-struck roomfull of Deep Learning researchers, in the last NeurIps?
possibly, my university doesn’t really do ai research beyond using it as a tool to engineer things. im looking to transfer to a different university.
but no, my take on reasoning is really a somewhat generalized reframing of the definition of reasoning (which you might find on the stanford encylopedia of philosophy) thats reframed partially in axiomatic building blocks of neural network components/terminology. im not claiming to have discovered reasoning, just redefine it in a way thats compatible and sensible to neural networks (ish).
Well you're free to define and redefine anything and as you like, but be aware that every time you move the target closer to your shot you are setting yourself up for some pretty strong confirmation bias.
yeah thats why i need help from the machine interpretability crowd to make sure my hypothesized reframing of reasoning has sufficient empirical basis and isn’t adrift in lalaland.
terribly sorry to be such a tease, but im looking to publish a paper on it, and still need to delve deeper into machine interpretability to make sure its empirically properly couched. if u can help with that perhaps we can continue this convo in private.
I'd like to see this o3 thing play 5d chess with multiverse time travel or baba is you.
The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.
If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.
> climate change is one of the most difficult and worst problems of our time.
Slightly surprised to see this view here.
I can think of half a dozen more serious problems off hand (e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself) along most axes I can think of (raw $ cost, QALYs, even X-risk).
You've been greviously mislead if you think climate change could plausibly make the world uninhabbitable in the next couple of centuries given current trajectories. I advise going to the primary sources and emailing a climate scientist at your local university for some references.
> going to the primary sources and emailing a climate scientist at your local university for some references
I assume you've done this, otherwise you wouldn't be telling me to? Bold of you to assume my ignorance on this subject. You sound like you've fallen for corporate grifters who care more about short-term profit and gains over long-term sustainability (or you are one of said grifters, in which case why are you wasting your time on HN, shouldn't you be out there grinding?!)
Severe weather events are going to get more common and more devastating over the next couple of decades. They'll come for you and people you care about, just as they come for me and people I care about. It doesn't matter what you think you know about it.
I've read some climate papers but haven't done the email thing (I should, but have not).
The IPCC summaries are a good read too.
Do you genuinely think severe weather events are going to be even amongst the top ten killers this century? If so, I do strongly advise emailing local uni climate scientist. (What's the worst that can happen? Heck, they might confirm your views!)
(In other circumstances I might go through the whole "what have you observed that has given you this belief?" thing, but in this case there is a simple and reliable check in the form of a 5 minute email)
... actually, I can do so on your behalf... would you like me to? The specific questions I would be asking unless told otherwise would be:
1. Probability of human extinction in the next century due to climate change.
2. Probability of more than 10% of human deaths being due to extreme weather.
3. Places to find good unbiased summaries of the likely effects of climate change.
1. Do you think a tornado has real probability of forming in north-western Europe, where historically there has never been one before? And what do you think are the chances of it being destructive in ways before unseen? (Think Netherlands, Belgium, Germany, ...)
2. How are the attractors (chaos theory) changing? Is it correct to say that, no, our weather prediction models are not going to be more accurate, all we can say is that weather is going to _change_ in all extremes? More intense storms. Colder winters. Hotter summers. Drier droughts.
3. What institution predicted the floods in Spain? Did anyone? Or was this completely unprecedented and a complete surprise?
I don't think that humans will go extinct from climate change, but it will drastically change where we can comfortably live and will uproot our ability to make meaningful cultural and scientific progress.
In your comment above you mention:
> e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself
These are all intertwined with each other and with climate change. People are less likely to have kids if they don't think those kids will have a comfortable future. Nuclear war is more likely if countries are competing for less and less resources as we deplete the planet and need to increase food production. Habitat loss from deforestation leads to animals comingling where they normally wouldn't, leading to increased risk of disease spillover into humans.
You claim that somebody saying "climate change is one of the most difficult and worst problems of our time" is a take you're surprised to see here on HN, but I'm more surprised that you don't list it in what you consider important problems.
You'd be surprised what the AVERAGE human fails to do that you think is easy, my mom can't fucking send an email without downloading a virus, i have a coworker that believes beyond a shadow of a doubt the world is flat.
The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person
Not being able to send an email or believing the world is flat it’s not a sign of intelligence, I’d rather say it’s more about culture or being more or less scholarized. Your mom or coworker still can do stuff instinctively that is outperforming every algorithm out there and still unexplained how we do it. We still have no idea what intelligence is
Yet the average human can drive a car a lot better than ChatGPT can, which shows that the way you frame "intelligence" dictates your conclusion about who is "intelligent".
Is that reason because Buffalo is the 81st most populated city in the United States, or 123rd by population density, and Waymo currently only serves approximately 3 cities in North America?
We already let computers control cars because they're better than humans at it when the weather is inclement. It's called ABS.
I would guess you haven't spent much time driving in the winter in the Northeast.
There is an inherent danger to driving in snow and ice. It is a PR nightmare waiting to happen because there is no way around accidents if the cars are on the road all the time in rust belt snow.
I get the feeling that the years I spent in Boston with a car including during the winter and driving to Ithaca somehow aren't enough, but whether or not I have is irrelevant. Still, I'll repeat the advice I was given before you have to drive in snow, go practice driving in the snow (in eg a parking lot) before needing to do so, esp during a storm. Waymo's been spotted driving in Buffalo doing testing, so it seems someone gave them similar advice. https://www.wgrz.com/article/tech/waymo-self-driving-car-pro...
There's always an inherent risk to driving, even in sunny Phoenix, Az. Winter dangers like black ice further multiply that risk but humans still manage to drive in winter. Taking a picture/video of a snowed over road and judging the width and inventing lanes based on the width taking into account snowbanks doesn't take an ML algorithm. Lidar can see black ice while human eyes can not, giving cars equipped with lidar (wether driven by a human or a computer) an advantage over those without it, and Waymo cars currently have lidar.
I'm sure there are new challenges for Waymo to solve before deploying the service in Buffalo, but it's not this unforeseen gotcha parent comment implies.
As far as the possible PR nightmare, you'd never do self driving cars in the first place if you let that fear control you because, you you pointed out, driving on the roads is inherently dangerous with too many unforeseen complications.
If you take an electrical sensory input signal sequence, and transform it to a electrical muscle output signal sequence you've got a brain. ChatGPT isn't going to drive a car because it's trained on verbal tokens, and it's not optimized for the type of latency you need for physical interaction.
And the brain doesn't use the same network to do verbal reasoning as real time coordination either.
But that work is moving along fine. All of these models and lessons are going to be combined into AGI. It is happening. There isn't really that much in the way.
Maybe, but no doubt these "dumb" people can still get dressed in the morning, navigate a trip to the mall, do the dishes, etc, etc.
It's always been the case that the things that are easiest for humans are hardest for computers, and vice versa. Humans are good at general intelligence - tackling semi-novel problems all day long, while computers are good at narrow problems they can be trained on such as chess or math.
The majority of the benchmarks currently used to evaluate these AI models are narrow skills that the models have been trained to handle well. What'll be much more useful will be when they are capable of the generality of "dumb" tasks that a human can do.
We can't agree whether Portia spiders are intelligent or just have very advanced instincts. How will we ever agree about what human intelligence is, or how to separate it from cultural knowledge? If that even makes sense.
I guess my point is more, if we can't decide about Portia Spiders or Chimps, then how can we be so certain about AI. So offering up Portia and Chimps as counter examples.
Yeah, I didn't realize Chimp studies, or neuroscience were out of vogue. Even in tech, people form strong 'beliefs' around what they think is happening.
What’s interesting is it might be very close to human intelligence than some “alien” intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.
In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.
In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.
It's possible humans reason better through text than not through text, so these models, having been trained on text, should be able to out-reason any person who's not currently sitting down to write.
Yeah, this is sort of meaningless without some idea of cost or consequences of a wrong answer. One of the nice things about working with a competent human is being able to tell them "all of our jobs are on the line" and knowing with certainty that they'll come to a good answer.
Agreed. I think what really makes them alien is everything else about them besides intelligence. Namely, no emotional/physiological grounding in empathy, shame, pride, and love (on the positive side) or hatred (negative side).
Human performance is much closer to 100% on this, depending on your human. It's easy to miss the dot in the corner of the headline graph in TFA that says "STEM grad."
It's not made of "steps", it's an almost continuous function to its inputs. And a function is not an algorithm: it is not an object made of conditions, jumps, terminations, ... Obviously it has computation capabilities and is Turing-complete, but is the opposite of an algorithm.
If it wasn’t made of steps then Turing machines wouldn’t be able to execute them.
Further, this is probably running an algorithm on top of an NN. Some kind of tree search.
I get what you’re saying though. You’re trying to draw a distinction between statistical methods and symbolic methods. Someday we will have an algorithm which uses statistical methods that can match human performance on most cognitive tasks, and it won’t look or act like a brain. In some sense that’s disappointing. We can build supersonic jets without fully understanding how birds fly.
Let's see that Turing machines can approximate the execution of NN :) That's why there are issues related to numerical precision, but the contrary is also true indeed, NNs can discover and use similar techniques used by traditional algorithms. However: the two remain two different methods to do computations, and probably it's not just by chance that many things we can't do algorithmically, we can do with NNs, what I mean is that this is not just related to the fact that NNs discover complex algorithms via gradient descent, but also that the computational model of NNs is more adapt to solving certain tasks. So the inference algorithm of NNs (doing multiplications and other batch transformations) is just needed for standard computers to approximate the NN computational model. You can do this analogically, and nobody would claim much (maybe?) it's running an algorithm. Or that brains themselves are algorithms.
Computers can execute precise computations, it's just not efficient (and it's very much slow).
NNs are exactly what "computers" are good for and we've been using since their inception: doing lots of computations quickly.
"Analog neural networks" (brains) work much differently from what are "neural networks" in computing, and we have no understanding of their operation to claim they are or aren't algorithmic. But computing NNs are simply implementations of an algorithm.
Edit: upon further rereading, it seems you equate "neural networks" with brain-like operation. But brain was an inspiration for NNs, they are not an "approximation" of it.
NN inference is an algorithm for computing an approximation of a function with a huge number of parameters. The NN itself is of course just a data structure. But there is nothing whatsoever about the NN process that is non-algorithmic.
It's the exact same thing as using a binary tree to discover the lowest number in some set of numbers, conceptually: you have a data structure that you evaluate using a particular algorithm. The combination of the algorithm and the construction of the data structure arrive at the desired outcome.
That's not the point, I think: you can implement the brain in BASIC, in theory, this does not means that the brain is per-se a BASIC program.
I'll provide a more theoretical framework for reasoning about this: if the way to solve certain problems by an NN (the learned weights) can't be translated in some normal program that DOES NOT resemble the activation of an NN, then the NNs are not algorithms, but a different computational model.
This may be what they were getting it, but it is still wrong. An NN is a computable function. So, NN inference is an algorithm for computing the function the NN represents. If we have an NN that represents a function f, with f(text) = most likely next character a human would write, then running the inference for that NN is an algorithm for finding out which character it's most likely a human would write next.
It's true that this is not an "enlightening" algorithm, it doesn't help us understand why or how that is the most likely next character. But this doesn't mean it's not an algorithm.
Each layer of the network is like a step, and each token prediction is a repeat of those layers with the previous output fed back into it. So you have steps and a memory.
"Continuous" would imply infinitely small steps, and as such, would certainly be used as a differentiator (differential? ;) between larger discrete stepped approach.
In essence, infinite calculus provides a link between "steps" and continuous, but those are different things indeed.
Deterministic (ieee 754 floats), terminates on all inputs, correctness (produces loss < X on N training/test inputs)
At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.
How do you define "algorithm"? I suspect it is a definition I would find somewhat unusual. Not to say that I strictly disagree, but only because to my mind "neural net" suggests something a bit more concrete than "algorithm", so I might instead say that an artificial neural net is an implementation of an algorithm, rather than or something like that.
But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.
I’ll believe it when the AI can earn money on its own. I obviously don’t mean someone paying a subscription to use the AI I mean, letting the AI lose on the Internet with only the goal of making money and putting it into a bank account.
You don't think there are already plenty of attempts out there?
When someone is "disinterested enough" to publish though, note the obvious way to launch a new fund or advisor with a good track record: crank out a pile of them, run them one or two years, discard the many losers and publish the one or two top winners. I.E. first you should be suspicious of why it's being published, then of how selected that result is.
No, the AI would have to start from zero and reason it's way to making itself money online, such as the humans who were first in their online field of interest (e-commerce, scams, ads etc from the 80's and 90's) when there was no guidance, only general human intelligence that could reason their way into money making opportunities and reason their way into making it work.
Which AI already has stored in spades, even more so since people in the 80's 90's weren't working with the information available today. The AI is free to research and read all the information stored from other humans as well, just like the humans who reasoned their way into money making opportunities--only with vastly more information now, talk about an advantage. But is it intelligent enough do so without a human giving direct/step-by-step instructions; the way humans figure it out?
This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).
So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.
In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1