Hacker News new | past | comments | ask | show | jobs | submit login

> I think we’re now way past that now with LLMs now quickly taking on the role of a general reasoning engine.

No we're not, and no they are not.

An LLM doesn't reason, period. It mimics reasoning ability by stochastically chosing a sequence of tokens. Alot of the time these make sense. At other times, they don't make any sense. I recently asked an LLM:

    "Mike leaves the elevator at the 2nd floor. Jenny leaves at the 9th floor. Who left the elevator first?"
It answered correctly that Mike leaves first. Then I asked:

    "If the elevator started at the 10th floor, who would have left first?"
And the answer was that Mike still leaves first, because he leaves at the 2nd floor, and that's the first floor the elevator reaches. Another time I asked an LLM how many footballs fit in a coffe-mug, and the conversation reached a point where the AI tried to convince me, that coffe-mugs are only slightly smaller than the trunk of a car.

Yes, they can also produce the correct answers to both these questions, but the fact that they can also spew such complete illogical nonsense shows that they are not "reasoning" about things. They complete sequences, that's it, period, that's literally the only thing a language model can do.

Their apparent emergent abilities look like reasoning, in the same way as Jen from "The IT crowd" can sound like shes speaking Italian, when in fact she has no idea what she is even saying.




> but the fact that they can also spew such complete illogical nonsense shows that they are not "reasoning" about things

Have you ever seen the proof that 2=1 ? It looks convincing, but it's illogical because it has a subtle flaw. Are the people who can't spot the flaw just "looking like they are reasoning", but really they just lack the ability to reason? Are witnesses who unintentionally make up memories in court cases lacking reasoning? Are children lacking reasoning when you ask them why they drew all over the walls and they make up BS?

You can't just spout that an LLM lacks reasoning without first strictly defining what it means to reason. Everybody keeps going on and on about how an LLM can't possibly be intelligent/reasoning/thinking/sentient etc. All of these are extremely vague and fuzzy words that have no unambiguous definition. Until we can come up with hard metrics that define these terms, nobody is correct when they spout their own nonsense that somehow proves the LLM doesn't fit into their specific definition of fill in the blank.


> Are the people who can't spot the flaw just "looking like they are reasoning", but really they just lack the ability to reason?

Lacking relevant information or insight into a topic, isn't the same as lacking the ability to reason.

> You can't just spout that an LLM lacks reasoning without first strictly defining what it means to reason.

Perfectly worded definition available on Wikipedia:

    Reason is the capacity of consciously applying logic by drawing conclusions from new or existing information, with the aim of seeking the truth.
"Consciously", "logic", and "seeking the truth" are the operative terms here. A sequence predictor does none of that. Looking at my above example: The sequence "Mike leaves the elevator first" isn't based on logical thought, or a conscious abstraction of the world built from ingesting the question. It's based on the fact that this sequence has statistically a higher chance to appear after the sequence representing the question.

How does our reasoning work? How do humans answer such a question? By building an abstract representation of the world based on the meaning of the words in the question. We can imagine Mike and Jenny in that Elevantor, we can imagine the elevator moving, floor numbers have meaning in the environment, and we understand what "something is higher up" means. From all this we build a model and draw conclusions.

How does the "reasoning" in the LLM work? It checks which tokens are likely to appear after another sequence of tokens. It does so by having learned how we like to build sequences of tokens in our language. That's it. There is no modeling of the situation going on, just stochastic analysis of a sequence.

Consequently, an LLM cannot "seek truth" either. If a sequence has a high chance of appearing in a position, it doesn't matter if it is factually true or not, or even logically sound. The model isn't trained on "true or false". It will, likely more often than not say things that are true, but not because it understands truth, but because the training data contain a lot of token sequences that, when interpreted by a human mind, state true things.

Lastly, imagine trying to apply a language model to an area that depends completely on the above definition of reasoning as a consequence of modeling the world based on observations and drawing new conclusions from that modeling.

https://www.spiceworks.com/tech/artificial-intelligence/news...


You must have missed the part where I said:

> Until we can come up with hard metrics that define these terms, nobody is correct when they spout their own nonsense that somehow proves the LLM doesn't fit into their specific definition of fill in the blank.

"Consciously", "logic", and "seeking the truth" are not objectively verifiable metrics of any kind.

I'll repeat what I said: Until we come up with hard metrics that define these terms, nobody can be correct. I'll take investopedia's definition for what a metric means, as that embodies the idea I was getting at the most succinctly:

> Metrics are measures of quantitative assessment commonly used for assessing, comparing, and tracking performance or production.[0]

So, until we can quantitatively assess how an LLM performs compared to a human in "consciousness", "logic", and "seeking the truth", whatever ambiguous definition you throw out there will not confirm or deny whether an LLM embodies these traits as opposed to a human embodying these traits.

[0]: https://www.investopedia.com/terms/m/metrics.asp


To elaborate a bit on my own post here:

The sequence "Mike leaves the elevator first" has a high statistical probability. The sequence "Jenny leaves the elevator first" has a lower probability that that. But it probably has still a much higher probability than "Michael is standing on the Moon", which in turn may be more likely than "Car dogfood sunshine Javascript", which is still probably more likely than "snglub dugzuvutz gummmbr ha tcha ding dong".

Note that none of these sequences are wrong in the world of a language model. They are just increasingly unlikely to occur in that position. To us with our ability to reason by logically drawing conclusions from an abstract internal model of the world, all these other sequences either represent false statements, or nonsensical word sald.


GPT-4 reasons about a lot of gotcha logic puzzles correctly any pre GPT-4 opinions should be reconsidered, that is effectively two different epochs in the history of AI effectiveness and reasoning.

> Me: Mike leaves the elevator at the 2nd floor. Jenny leaves at the 9th floor. Who left the elevator first?

> GPT-4: Mike left the elevator first, as he got off at the 2nd floor, while Jenny left at the 9th floor.

> Me: If the elevator started at the 10th floor, who would have left first?

> GPT-4: If the elevator started at the 10th floor and went downward, then Jenny would have left first, as she got off at the 9th floor, while Mike left at the 2nd floor.

> Me: How many footballs fit in a coffe-mug?

> GPT-4: A standard football (soccer ball) has a diameter of around 22 centimeters (8.65 inches), while a coffee mug is typically much smaller, with a diameter of around 8-10 centimeters (3-4 inches). Therefore, it is not possible to fit a standard football inside a coffee mug. If you were to use a mini football or a much larger mug, the number of footballs that could fit would depend on the specific sizes of the footballs and the mug.

It easily answered all of your questions and produces explanations I would expect most reasonable people to make.


That changes exactly nothing about the validity of my statement.

Yes, GPT-4 is better at this mimicry than GPT-3 or GPT-3.5. And GPT-3 was better at it than GPT-2. And all of them were better than my out-of-fun home-built Language Model projects that I trained on small <10GiB Datasets, which in turn were better at it than my Poc models trained on just a few thousand words.

But being better at mimicking reason, is still not reasoning. The model doesn't know what a coffeemug is, and it doesn't know what a football is. It also has no idea how elevators work. It can form sequences that make it look to us that it does and knows all these things, but in reality, it only knows that "then Jenny would have left first" is a more likely sequence of tokens at that point, given that the sequence before included "started at the 10th floor".

Bear in mind, this doesn't mean that this mimicry isn't useful. It is, tremendously so. I don't care how I get correct answers, I only care that I do.


> But being better at mimicking reason, is still not reasoning

How do I know people are not using a similar process when they perform "reasoning" but with a way more elaborate model?

Can you prove me that the two are inherently different in the type of output they produce regardless of how large a ML model is or can be?

Because if you can't, and they produce the same type of output, the processing could be similar enough to be considered reasoning.


> but with a way more elaborate model?

Simple: I know that humans have intentionality and agency. They want things, they have goals both immediate and long term. Their replies are based not just on the context of their experiences and the conversation but their emotional and physical state, and the applicability of their reply to their goals.

And they are capable of coming up with reasoning about topics for which they have no prior information, by applying reasonable similarities. Example: Even if someone never heard the phrase "walking a mile in someone elses shoes", most humans (provided they speak english) have no difficulty in figuring out what this means. They also have no trouble figuring out that this is a figure of speech, and not a literal action.


>Simple: I know that humans have intentionality and agency. They want things, they have goals both immediate and long term. Their replies are based not just on the context of their experiences and the conversation but their emotional and physical state, and the applicability of their reply to their goals.

This all seems orthogonal to reasoning, but also who is to say that somewhere in those billions of parameters there isn't something like a model of goals and emotional state? I mean, I seriously doubt it, but I also don't think I could evidence that.


> but also who is to say that somewhere in those billions of parameters there isn't something like a model of goals and emotional state?

No one, but as is well established, absence of proof of nonexistence isn't an argument for existence. https://en.wikipedia.org/wiki/Russell's_teapot


Correct, but the problem is how you prove that for humans is by using the output and inferring that. You can apply the same criteria to ML models. If you don't, you need some other criteria to rule out that assumption for ML models.


For humans I can simply refer to my own internal state and look at how I arrive by conclusions.

I am of course aware that this is essentially a form of Ipse dixit, but I will do it anway in this case, because I am saying it as a human, about humans, and to other humans, and so the audience can just try it for themselves.


> I know that humans have intentionality and agency.

You assume that. You can only maybe know that about yourself. But my question was bit different. How do you know that the ML model doesn't?

> about topics for which they have no prior information, by applying reasonable similarities.

This is a contradiction. If you have no prior information about a topic you can't know even what topic is similar.

> Even if someone never heard the phrase "walking a mile in someone elses shoes".

Same for ML modes. They don't have a representation of every possible prompt.


> You assume that. You can only maybe know that about yourself.

I can also only say with certainty that planetary gravity is an attracting force on the very spot I am standing on. I haven't visited every spot on every planet in the universe after all.

That doesn't make it any more likely that my extrapolation of how gravity works here is wrong somewhere else. Russels Teapot works both ways.

> How do you know that the ML model doesn't?

For the same reason why I know that a Hammer or an Operating System don't. I know how they work. Not in the most minute details, and of course the actual model is essentially a black box, but it's architecture, and MO are not.

It completes sequences. That is all it does. It has no semantic understanding of the things these sequences represent. It has no understanding of true or false. It doesn't know math, it doesn't know who person xyz is, it doesn't know that 1993 already happened and 2221 did not. It cannot have abstract concepts of the things represented by the sequences, because the sequences are the things in its world.

It knows that a sequence is more or less likely to follow another sequence. That's it.

From that limited knowledge however, it can very successfully mimick things like math, logic, and even reasoning to an extend. And it can mimick them well enough to be useful in a lot of areas.

But that mimickry, however useful, is still mimickry. It's still the Chinese-Room thought experiment.


I agree about the utility part.

However, I don't really accept the idea that this isn't reasoning, but I'm not entirely sold either way.

I'd say if it mimics something well enough then eventually it's just doing the thing, which is the same side of the argument I fall on with Searle's Chinese Room Argument. If you can't discern a difference, is there a difference?

So far GPT-4 can produce better work than like 50% of humans and better responses to brain teaser questions than most of them too, I'm at least just in a bubble and so I don't run into people that stupid that often. So it's easier for me to see the gaps still.


> I'd say if it mimics something well enough then eventually it's just doing the thing

Right up to the point where it actually needs to reason, and the mimickry doesn't suffice.

My above example about the Football and the Coffemug is an easy one, the objects are well represented in its training data. What if I need a reason why the Service Ping spikes every 60 seconds, here is the code, please LLM look it up. I am sure I will get a great and well written answer.

I am also sure it won't be the correct one, which is that some dumb script I wrote, which has nothing to do with the code shown, blocks the server for about 700ms every minute.

Figuring out that something cannot be explained with the data represented, and thus may come from a source unseen, is one example of actual reasoning. And this "giving up on the data shown" is something I have yet to see any AI do.


I could say the same about most second rate software engineers. Thats why im not moved by your arguments. Theres plenty of peope just as stupid and who will give you confidently wrong answers.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: