Hacker News new | past | comments | ask | show | jobs | submit login

So, LLMs face a regression on their latest proposed improvement. It's not surprising considering their functional requirements are:

1) Everything

For the purpose of AGI, LLM are starting to look like a local maximum.




>For the purpose of AGI, LLM are starting to look like a local maximum.

I've been saying it since they started popping off last year and everyone was getting euphoric about them. I'm basically a layman - a pretty good programmer and software engineer, and took a statistics and AI class 13 years ago in university. That said, it just seems so extremely obvious to me that these things are likely not the way to AGI. They're not reasoning systems. They don't work with axioms. They don't model reality. They don't really do anything. They just generate stochastic output from the probabilities of symbols appearing in a particular order in a given corpus.

It continues to astound me how much money is being dumped into these things.


How do you know that they don’t do these things? Seems hard to say for sure since it’s hard to explain in human terms what a neural network is doing.


Absence of evidence or a simple explanation does not mean that you can imbue statistical regression with animal spirits.


The burden of proof goes both ways: if you want to say X isn’t really the same thing as human general intelligence, you have to be able to confidently say human general intelligence isn’t really the same thing as X.


An interesting mental trap, except that the indirect evidence keeps mounting that LLMs do not possess human general intelligence, even if we can not describe exactly how it exists in the brain.


On the contrary, the parallels between the peculiarities of LLMs and various aspects of human cognition seem very striking to me. Given how early we are in figuring out what we can accomplish with LLMs, IMO the appropriate epistemic stance is to not reach any unequivocal conclusions. And then my personal hunch is that LLMs may be most of the magic, with how they're orchestrated and manipulated being the remainder (which may take a very long time to figure out).


I think it's just that I understand LLMs better than you, and I know that they are very different from human intelligence. Here's a couple of differences:

- LLMs use fixed resources when computing an answer. And to the extent that they don't, they are function calling and the behaviour is not attributable to the LLMs. For example when using a calculator, it is displaying calculator intelligence.

- LLMs do not have memory, and if they do it's very recent and limited, and distinct from any being so far. They don't remember what you said 4 weeks ago, and they don't incorporate that into their future behaviour. And if they do, the way they train and remember is very distinct from that of humans and relates to it being a system being offered as a free service to multiple users. Again to the extent that they are capable of remembering, their properties are not that of LLMs and are attributable to another layer called via function calling.

LLMs are a perception layer for language, and perhaps for output generation, but they are not the intelligence.


Are you not imbueing animals with spirits based on lack of evidence of statistical regression?


If you give an LLM a word problem that involves the same math and change the names of the people in the word problem the LLM will likely generate different mathematical results. Without any knowledge of how any of this works, that seems pretty damning of the fact that LLMs do not reason. They are predictive text models. That’s it.



It's worth noting that this may not be result of a pure LLM, it's possible that ChatGPT is using "actions", explicitly:

1- running the query through a classifier to figure out if the question involves numbers or math 2- Extract the function and the operands 3- Do the math operation with standard non-LLM mechanisms 4- feed back the solution to the LLM 5- Concatenate the math answer with the LLM answer with string substitution.

So in a strict sense this is not very representative of the logical capabilities of an LLM.


Then what's the point of ever talking about LLM capabilities again? We've already hooked them up to other tools.

This confusion was introduced at the top of the thread. If the argument is "LLMs plus tooling can't do X," the argument is wrong. If the argument is "LLMs alone can't do X," the argument is worthless. In fact, if the argument is that binary at all, it's a bad argument and we should laugh it out of the room; the idea that a lay person uninvolved with LLM research or development could make such an assertion is absurd.


It shows you when it's calling functions. I also did the same test with Llama, which runs locally and cannot access function calls and it works.


You are right I actually downloaded Llama to do more detailed tests. God bless Stallman.


Minor edits to well known problems do easily fool current models though. Here's one 4o and o1-mini fail on, but o1-preview passes. (It's the mother/surgeon riddle so kinda gore-y.)

https://chatgpt.com/share/6723477e-6e38-8000-8b7e-73a3abb652...

https://chatgpt.com/share/6723478c-1e08-8000-adda-3a378029b4...

https://chatgpt.com/share/67234772-0ebc-8000-a54a-b597be3a1f...


I think you didn't use the "share" function; I cannot open any of these links. Can you do it in a private browser session (so you're not logged in)?


Oops, fixed the links.

mini's answer is correct, but then it forgets that fathers are male in the next sentence.


At this point I really only take rigorous research papers in to account when considering this stuff. Apple published research just this month that the parent post is referring to. A systematic study is far more compelling than an anecdote.

https://machinelearning.apple.com/research/gsm-symbolic


That study shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks(some even see increases). The one that isn't involves changing more than names.

Changing names does not affect the performance of Sota models.


>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks.

Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models".


Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%). No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.


Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".


Ah, that’s a good point thanks for the correction.


Only if there isn’t a systemic fault, eg bad prompting.

Their errors appear to disappear when you correctly set the context from conversational to adversarial testing — and Apple is actually testing the social context and not its ability to reason.

I’m just waiting for Apple to release their GSM-NoOp dataset to validate that; preliminary testing shows it’s the case, but we’d prefer to use the same dataset so it’s an apples-to-apples comparison. (They claim it will be released “soon”.)


To be fair, the claim wasn't that it always produced the wrong answer, just that there exists circumstances where it does. A pair of examples where it was correct hardly justifies a "demonstrably false" response.


Conversely, a pair of examples where it was incorrect hardly justifies the opposite response.

If you want a more scientific answer there is this recent paper: https://machinelearning.apple.com/research/gsm-symbolic


It kind of does though, because it means you can never trust the output to be correct. The error is a much bigger deal than it being correct in a specific case.


You can never trust the outputs of humans to be correct but we find ways of verifying and correcting mistakes. The same extra layer is needed for LLMs.


> It kind of does though, because it means you can never trust the output to be correct.

Maybe some HN commenters will finally learn the value of uncertainty then.


This is what kind of comments you make when your experience with LLMs is through memes.


This is a relatively trivial task for current top models.

More challenging are unconventional story structures, like a mom named Matthew with a son named Mary and a daughter named William, who is Matthew's daughter?

But even these can still be done by the best models. And it is very unlikely there is much if any training data that's like this.


That's a neat example problem, thanks for sharing!

For anyone curious: https://chatgpt.com/share/6722d130-8ce4-800d-bf7e-c1891dfdf7...

> Based on traditional naming conventions, it seems that the names might have been switched in this scenario. However, based purely on your setup:

>

> Matthew has a daughter named William and a son named Mary.

>

> So, Matthew's daughter is William.


How do people fair on unconventional structures? I am reminded of that old riddle involving a the mother being the doctor after a car crash.


No idea why you've been downvoted, because that's a relevant and true comment. A more complex example would be the Monty Hall problem [1], for which even some very intelligent people will intuitively give the wrong answer, whereas symbolic reasoning (or Monte Carlo simulations) leads to the right conclusion.

[1] https://en.wikipedia.org/wiki/Monty_Hall_problem


And yet, humans, our benchmark for AGI, suffer from similar problems, with our reasoning being heavily influenced by things that should have been unrelated.

https://en.m.wikipedia.org/wiki/Priming_(psychology)


The whole design of an LLM is to consume and compress a huge space of human-generared content and use that to predict how a human would reply, one token at a time. That alone means the LLM isn't modelling anything beyond the human content it was trained on, and there is no reasoning since every prediction is based only on probabilities combined with controls similar to randomization factors used to avoid an entirely deterministic algorithm.


That’s not an accurate description. Attention / multi-head attention mechanisms allow the model to understand relationships between words far apart and their context.

They still lack, as far as we know, a world model, but the results are already eerily similar to how most humans seem to think - a lot of our own behaviour can be described as “predict how another human would reply”.


When trained on simple logs of Othello's moves, the model learns an internal representation of the board and its pieces. It also models the strength of its opponent.

https://arxiv.org/abs/2210.13382

I'd be more surprised if LLMs trained on human conversations don't create any world models. Having a world model simply allows the LLM to become better at sequence prediction. No magic needed.

There was another recent paper that shows that a language model is modelling things like age, gender, etc., of their conversation partner without having been explicitly trained for it


Do we know for a fact that the mechanisms are actually used that way inside the model?

My understand was that they know how the model was designed to be able to work, but that there's been very little (no?) progress in the black box problem so we really don't know much at all about what actually happens internally.

Without better understanding of what actually happens when an LLM generates an answer I stick with the most basic answer that its simply predicting what a human would say. I could be wildly misinformed there, I don't work directly in the space and its been moving faster than I'm interested in keeping up with.


For a lot of the content they were trained on, it seems like the easiest way to predict the next token would be to model the world or work with axioms. So how do we know that an LLM isn't doing these things internally?


In fact, it looks like the model is doing those things internally.

  We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
https://arxiv.org/html/2405.07987v5


Unless I misread this paper, their argument is entirely hypothetical. Meaning that the LLM is still a black box and they can only hypothesise what is going internally by viewing the output(s) and guessing at what it would take to get there.

There's nothing wrong with a hypothesis or that process, but it means we still don't know whether models are doing this or not.


Maybe I mixed up that paper with another but the one I meant to post shows that you can read something like a world model from the activations of the layers.

There was a paper that shows a model trained on Othello moves creates a model of the board, models the skill level of their opponent and more.


Well my understanding is that there's ultimately the black box problem. We keep building these models and the output seems to get better, but we can't actually inspect how they work internally.


How do we know Santa doesn't exist? Maybe he does.


If you expect "the right way" to be something _other_ than a system which can generate a reasonable "state + 1" from a "state" - then what exactly do you imagine that entails?

That's how we think. We think sequentially. As I'm writing this, I'm deciding the next few words to type based on my last few.

Blows my mind that people don't see the parallels to human thought. Our thoughts don't arrive fully formed as a god-given answer. We're constantly deciding the next thing to think, the next word to say, the next thing to focus on. Yes, it's statistical. Yes, it's based on our existing neural weights. Why are you so much more dismissive of that when it's in silicon?


Because we still don't know how the brain really does all it does in very specific terms, so why assume to know exactly how we think?


Why is there only one valid way of producing thoughts?


Finite-state machines are a limited model. In principle, you can use them to model everything that can fit in the observable universe. But that doesn't mean they are a good model for most purposes.

The biggest limitation with the current LLMs is the artificial separation between training and inference. Once deployed, they are eternally stuck in the same moment, always reacting but incapable of learning. At best, they are snapshots of a general intelligence.

I also have a vague feeling that a fixed set of tokens is a performance hack that ultimately limits the generality of LLMs. That hardcoded assumptions make tasks that build on those assumptions easier and seeing past the assumptions harder.


> At best, they are snapshots of a general intelligence.

So are we, at any given moment.


> As I'm writing this, I'm deciding the next few words to type based on my last few.

If so you could have written this as a newborn baby, you are determining these words based on a lifetime of experience. LLMs doesn't do that, every instance of ChatGPT is the same newborn baby while a thousand clones of you could all be vastly different.


  We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
https://arxiv.org/html/2405.07987v5


I totally agree that they’re a local maximum and they don’t seem like a path to AGI. But they’re definitely kinda reasoning systems, in the sense that they can somewhat reason about things. The whacky process they use to get there doesn’t take away from that IMO


> I've been saying it since they started popping off last year and everyone was getting euphoric about them.

Remember the resounding euphoria at the LK-99 paper last year, and how everyone suddenly became an expert on superconductors? It's clear that we've collectively learned nothing from that fiasco.

The idea of progress itself has turned into a religious cult, and what's worse, "progress" here is defined to mean "whatever we read about in 1950s science fiction".


> It continues to astound me how much money is being dumped into these things.

Maybe in our society there's a surprising amount of value of a "word stirrer" intelligence. Sure, if it was confident when it was right and hesitant when it was wrong it'd be much better. Maybe humans are confidently wrong often enough that an artificial version that's compendious experience to draw on is groundbreaking.


I am pretty sure Claude 3.5 Sonnet can reason or did reason with a particular snippet of code I was working on. I am not an expert in this area but my guessing is that these neural nets (made for language prediction) are being used for reasoning. But that’s not their optimal behavior (since they are token predictor). A big jump in reasoning will happen when reasoning is off loaded to an LRM.

Human brains are sure big but they are inefficient because a big portion of the brain is going to non-intelligence stuff like running the body internal organs, eye vision, etc…

I do agree that the money is not well spent. They should haver recognized that we are hitting s local maximum with the current model and funding should be going to academic/theoretical instead of dump brute force.


> So, LLMs face a regression on their latest proposed improvement.

Arguably a second regression, the first being cost, because COT improves performance by scaling up the amount of compute used at inference time instead of training time. The promise of LLMs was that you do expensive training once and then run the model cheaply forever, but now we're talking about expensive training followed by expensive inference every time you run the model.


To be fair they also advanced in the cost aspect with other models

gpt4o and 4o mini have a tenth and a hundredth of inference cost of gpt4 respectively


> So, LLMs face a regression on their latest proposed improvement.

A regression that humans also face, and we don't say therefore that it is impossible to improve human performance by having them think longer or work together in groups, we say that there are pitfalls. This is a paper saying that LLMs don't exhibit superhuman performance.


LLMs are a local maximum in the same way that ball bearings can't fly. LLM-like engines will almost certainly be components of an eventual agi-level machine.


What is your "almost certainty" based on? What does it even mean? Every thread on LLMs is full of people insisting their beliefs are certainties.

What I'm certain is we should not praise the inventor of ball bearings for inventing flight, nor once ball bearings were invented flight became unavoidable and only a matter of time.


I say 'almost certainly' because LLMs are basically just a way to break down language into it's component ideas. Any AGI level machine will most certainly be capable of swapping sematic 'interfaces' at will, and something like an LLM is a very convenient way to encode that interface.


I don’t think that’s necessarily true, that presumes that the cobbled together assortment of machine learning algorithms we have now will somehow get agi, if we need a fundamentally different way of doing things there’s no reason to assume it will use a language model at all.


I agree, my bet is that they will be used for NLP, and ML debugging/analysis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: