How do you know that they don’t do these things? Seems hard to say for sure sinc...

FuckButtons · 2024-10-31T01:44:00 1730339040

Absence of evidence or a simple explanation does not mean that you can imbue statistical regression with animal spirits.

toasterlovin · 2024-10-31T02:46:33 1730342793

The burden of proof goes both ways: if you want to say X isn’t really the same thing as human general intelligence, you have to be able to confidently say human general intelligence isn’t really the same thing as X.

beardedwizard · 2024-10-31T13:02:06 1730379726

An interesting mental trap, except that the indirect evidence keeps mounting that LLMs do not possess human general intelligence, even if we can not describe exactly how it exists in the brain.

toasterlovin · 2024-10-31T16:30:11 1730392211

On the contrary, the parallels between the peculiarities of LLMs and various aspects of human cognition seem very striking to me. Given how early we are in figuring out what we can accomplish with LLMs, IMO the appropriate epistemic stance is to not reach any unequivocal conclusions. And then my personal hunch is that LLMs may be most of the magic, with how they're orchestrated and manipulated being the remainder (which may take a very long time to figure out).

TZubiri · 2024-11-01T20:17:29 1730492249

I think it's just that I understand LLMs better than you, and I know that they are very different from human intelligence. Here's a couple of differences:

- LLMs use fixed resources when computing an answer. And to the extent that they don't, they are function calling and the behaviour is not attributable to the LLMs. For example when using a calculator, it is displaying calculator intelligence.

- LLMs do not have memory, and if they do it's very recent and limited, and distinct from any being so far. They don't remember what you said 4 weeks ago, and they don't incorporate that into their future behaviour. And if they do, the way they train and remember is very distinct from that of humans and relates to it being a system being offered as a free service to multiple users. Again to the extent that they are capable of remembering, their properties are not that of LLMs and are attributable to another layer called via function calling.

LLMs are a perception layer for language, and perhaps for output generation, but they are not the intelligence.

broast · 2024-10-31T12:34:35 1730378075

Are you not imbueing animals with spirits based on lack of evidence of statistical regression?

_m9r2 · 2024-10-31T00:04:34 1730333074

If you give an LLM a word problem that involves the same math and change the names of the people in the word problem the LLM will likely generate different mathematical results. Without any knowledge of how any of this works, that seems pretty damning of the fact that LLMs do not reason. They are predictive text models. That’s it.

alexwebb2 · 2024-10-31T00:10:10 1730333410

Demonstrably false.

https://chatgpt.com/share/6722ca8a-6c80-800d-89b9-be40874c5b...

https://chatgpt.com/share/6722ca97-4974-800d-99c2-bb58c60ea6...

TZubiri · 2024-10-31T00:53:45 1730336025

It's worth noting that this may not be result of a pure LLM, it's possible that ChatGPT is using "actions", explicitly:

1- running the query through a classifier to figure out if the question involves numbers or math 2- Extract the function and the operands 3- Do the math operation with standard non-LLM mechanisms 4- feed back the solution to the LLM 5- Concatenate the math answer with the LLM answer with string substitution.

So in a strict sense this is not very representative of the logical capabilities of an LLM.

digging · 2024-10-31T15:06:54 1730387214

Then what's the point of ever talking about LLM capabilities again? We've already hooked them up to other tools.

This confusion was introduced at the top of the thread. If the argument is "LLMs plus tooling can't do X," the argument is wrong. If the argument is "LLMs alone can't do X," the argument is worthless. In fact, if the argument is that binary at all, it's a bad argument and we should laugh it out of the room; the idea that a lay person uninvolved with LLM research or development could make such an assertion is absurd.

thomashop · 2024-10-31T04:04:01 1730347441

It shows you when it's calling functions. I also did the same test with Llama, which runs locally and cannot access function calls and it works.

TZubiri · 2024-10-31T06:27:42 1730356062

You are right I actually downloaded Llama to do more detailed tests. God bless Stallman.

astrange · 2024-10-31T08:25:57 1730363157

Minor edits to well known problems do easily fool current models though. Here's one 4o and o1-mini fail on, but o1-preview passes. (It's the mother/surgeon riddle so kinda gore-y.)

https://chatgpt.com/share/6723477e-6e38-8000-8b7e-73a3abb652...

https://chatgpt.com/share/6723478c-1e08-8000-adda-3a378029b4...

https://chatgpt.com/share/67234772-0ebc-8000-a54a-b597be3a1f...

_flux · 2024-10-31T08:55:17 1730364917

I think you didn't use the "share" function; I cannot open any of these links. Can you do it in a private browser session (so you're not logged in)?

astrange · 2024-10-31T09:03:03 1730365383

Oops, fixed the links.

mini's answer is correct, but then it forgets that fathers are male in the next sentence.

SequoiaHope · 2024-10-31T02:01:33 1730340093

At this point I really only take rigorous research papers in to account when considering this stuff. Apple published research just this month that the parent post is referring to. A systematic study is far more compelling than an anecdote.

https://machinelearning.apple.com/research/gsm-symbolic

og_kalu · 2024-10-31T02:11:24 1730340684

That study shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks(some even see increases). The one that isn't involves changing more than names.

Changing names does not affect the performance of Sota models.

gruez · 2024-10-31T02:18:55 1730341135

>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks.

Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models".

og_kalu · 2024-10-31T02:29:31 1730341771

Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%). No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.

gruez · 2024-10-31T02:39:37 1730342377

Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".

SequoiaHope · 2024-10-31T07:24:10 1730359450

Ah, that’s a good point thanks for the correction.

zmgsabst · 2024-10-31T14:47:56 1730386076

Only if there isn’t a systemic fault, eg bad prompting.

Their errors appear to disappear when you correctly set the context from conversational to adversarial testing — and Apple is actually testing the social context and not its ability to reason.

I’m just waiting for Apple to release their GSM-NoOp dataset to validate that; preliminary testing shows it’s the case, but we’d prefer to use the same dataset so it’s an apples-to-apples comparison. (They claim it will be released “soon”.)

gruez · 2024-10-31T02:12:32 1730340752

To be fair, the claim wasn't that it always produced the wrong answer, just that there exists circumstances where it does. A pair of examples where it was correct hardly justifies a "demonstrably false" response.

thomashop · 2024-10-31T07:30:28 1730359828

Conversely, a pair of examples where it was incorrect hardly justifies the opposite response.

If you want a more scientific answer there is this recent paper: https://machinelearning.apple.com/research/gsm-symbolic

EraYaN · 2024-10-31T10:28:38 1730370518

It kind of does though, because it means you can never trust the output to be correct. The error is a much bigger deal than it being correct in a specific case.

thomashop · 2024-10-31T14:10:35 1730383835

You can never trust the outputs of humans to be correct but we find ways of verifying and correcting mistakes. The same extra layer is needed for LLMs.

digging · 2024-10-31T15:09:18 1730387358

> It kind of does though, because it means you can never trust the output to be correct.

Maybe some HN commenters will finally learn the value of uncertainty then.

jklinger410 · 2024-10-31T00:28:28 1730334508

This is what kind of comments you make when your experience with LLMs is through memes.

Workaccount2 · 2024-10-31T00:21:45 1730334105

This is a relatively trivial task for current top models.

More challenging are unconventional story structures, like a mom named Matthew with a son named Mary and a daughter named William, who is Matthew's daughter?

But even these can still be done by the best models. And it is very unlikely there is much if any training data that's like this.

alexwebb2 · 2024-10-31T00:38:33 1730335113

That's a neat example problem, thanks for sharing!

For anyone curious: https://chatgpt.com/share/6722d130-8ce4-800d-bf7e-c1891dfdf7...

> Based on traditional naming conventions, it seems that the names might have been switched in this scenario. However, based purely on your setup:

>

> Matthew has a daughter named William and a son named Mary.

>

> So, Matthew's daughter is William.

rileymat2 · 2024-10-31T03:15:43 1730344543

How do people fair on unconventional structures? I am reminded of that old riddle involving a the mother being the doctor after a car crash.

adwn · 2024-10-31T07:31:04 1730359864

No idea why you've been downvoted, because that's a relevant and true comment. A more complex example would be the Monty Hall problem [1], for which even some very intelligent people will intuitively give the wrong answer, whereas symbolic reasoning (or Monte Carlo simulations) leads to the right conclusion.

[1] https://en.wikipedia.org/wiki/Monty_Hall_problem

vanviegen · 2024-10-31T07:34:15 1730360055

And yet, humans, our benchmark for AGI, suffer from similar problems, with our reasoning being heavily influenced by things that should have been unrelated.

https://en.m.wikipedia.org/wiki/Priming_(psychology)

_heimdall · 2024-10-31T02:50:04 1730343004

The whole design of an LLM is to consume and compress a huge space of human-generared content and use that to predict how a human would reply, one token at a time. That alone means the LLM isn't modelling anything beyond the human content it was trained on, and there is no reasoning since every prediction is based only on probabilities combined with controls similar to randomization factors used to avoid an entirely deterministic algorithm.

ricardobeat · 2024-10-31T03:20:22 1730344822

That’s not an accurate description. Attention / multi-head attention mechanisms allow the model to understand relationships between words far apart and their context.

They still lack, as far as we know, a world model, but the results are already eerily similar to how most humans seem to think - a lot of our own behaviour can be described as “predict how another human would reply”.

thomashop · 2024-10-31T07:27:30 1730359650

When trained on simple logs of Othello's moves, the model learns an internal representation of the board and its pieces. It also models the strength of its opponent.

https://arxiv.org/abs/2210.13382

I'd be more surprised if LLMs trained on human conversations don't create any world models. Having a world model simply allows the LLM to become better at sequence prediction. No magic needed.

There was another recent paper that shows that a language model is modelling things like age, gender, etc., of their conversation partner without having been explicitly trained for it

_heimdall · 2024-10-31T12:14:23 1730376863

Do we know for a fact that the mechanisms are actually used that way inside the model?

My understand was that they know how the model was designed to be able to work, but that there's been very little (no?) progress in the black box problem so we really don't know much at all about what actually happens internally.

Without better understanding of what actually happens when an LLM generates an answer I stick with the most basic answer that its simply predicting what a human would say. I could be wildly misinformed there, I don't work directly in the space and its been moving faster than I'm interested in keeping up with.

ChadNauseam · 2024-10-31T03:43:11 1730346191

For a lot of the content they were trained on, it seems like the easiest way to predict the next token would be to model the world or work with axioms. So how do we know that an LLM isn't doing these things internally?

thomashop · 2024-10-31T04:00:55 1730347255

In fact, it looks like the model is doing those things internally.

  We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

https://arxiv.org/html/2405.07987v5

_heimdall · 2024-10-31T13:30:08 1730381408

Unless I misread this paper, their argument is entirely hypothetical. Meaning that the LLM is still a black box and they can only hypothesise what is going internally by viewing the output(s) and guessing at what it would take to get there.

There's nothing wrong with a hypothesis or that process, but it means we still don't know whether models are doing this or not.

thomashop · 2024-10-31T14:30:24 1730385024

Maybe I mixed up that paper with another but the one I meant to post shows that you can read something like a world model from the activations of the layers.

There was a paper that shows a model trained on Othello moves creates a model of the board, models the skill level of their opponent and more.

_heimdall · 2024-10-31T12:15:32 1730376932

Well my understanding is that there's ultimately the black box problem. We keep building these models and the output seems to get better, but we can't actually inspect how they work internally.

wg0 · 2024-10-31T13:27:07 1730381227

How do we know Santa doesn't exist? Maybe he does.