Note that for the purposes of this paper a “problem” just means a formally decidable problem or a formal language, and the proof is that by creatively arranging transformers you can make individual transformer runs behave like individual Boolean circuits. However, this is a long way from any practical application of transformers: for one thing, most problems we care about are not stated as formal languages, and we already have an exceptionally more efficient way to implement Boolean circuits.
If a "problem we care about" is not stated as a formal language, does it mean it does not exist in the hierarchy of formal languages? Or is it just as yet unclassified?
It means that there are two problems: one, to formalize the problem as stated while capturing all relevant details, and two, solving the resulting formal problem. Until you solve problem one, you can't use formal methods to say anything about the problem (it's not even clear a priori that a problem is even solvable).
Unfortunately, the task of a formalizing an informal problem is itself an informal problem that we don't know how to formalize, so we can't say much about it. So overall, we can't say much about how hard the general problem "given a problem statement from a human, solve that problem" is, whether any particular system (including a human!) can solve it and how long that might take with what resources.
No, but it's pretty obvious, isn't it? If you have an informal problem statement, say "I want this button to be bigger", formalizing it can't be a formal process.
> "I want this button to be bigger", formalizing it can't be a formal process.
while (!is_button_big_enough()) {
button.scaleUp(1.1);
}
This is one trivial way to do it, and seems like it would be formalizable. is_button_big_enough is simply an input to whatever process is responsible for judging such a thing, whether that be a design specification or perhaps input from a person.
You've translated my informal problem statement into a quasi-formal process, using your inherent natural language processing skills, and your knowledge of general human concepts like size. But you haven't explained the formal process you followed to go from my problem statement to this pseudocode.
And your pseudocode template only works for one particular kind of informal problem statement. If I instead have the problem "how much money do I need to buy this house and this chair?", or "does this byte fit in my mouth?", your general form will not work.
And what's more, you haven't actually produced a formally solvable problem definition, that we could analyze for complexity and computability, because you rely on two completely unspecified functions. Where is the formal defintion of a button? Is it a physical push button or a UI control or a clothing button? What does it mean that it is bigger or smaller? When do we know it's big enough, is that computable? And how do we scale it up? Do we increase its volume? Its surface area? One of its sides? Or maybe the radius? And how do we go about doing that? All of these, and many more, need to be explicitly defined in order to apply any kind of formal analysis to this problem. And there is no formal way to do so in a way that matches the intent of whoever posed the problem.
> And what's more, you haven't actually produced a formally solvable problem definition, that we could analyze for complexity and computability, because you rely on two completely unspecified functions. Where is the formal defintion of a button?
Well your statement was underspecified. You said "I want this button bigger". There are procedures to translate informal statements to formal ones, but one basic step is underspecified referents are delegated to abstractions that encapsulate those details, so "this button" designates some kind of model of a button, and "I" refers to a subject outside the system thereby implying some kind of interactive process to query the subject whether the model is satisfactory, eg. a dialog prompt asking "Is this button big enough now?"
You call these skills "inherent", but humans are not magical. We employ bug riddled poorly specified procedures for doing this kind of interpretive work, and LLMs have already started to do this too, and they'll only get better. Is asking a deterministic LLM to generate a formal specification or program to achieve some result a formal process? I don't think these lines are as clear as many think, not anymore.
I think we're mostly agreed actually. I'm not trying to claim that this is an unsolvable problem, just that it's a difficult problem that we don't have a solution for yet. And yes, LLMs are probably our best tool so far. And asking for clarifying questions is clearly a part of the process.
I will say that there is also a possibility the general form of the formal problem is in fact uncomputable. It seems possible to me it might be related to the halting problem. But, until we have a formal specification of it, we won't know, of course.
There are procedures for translating informal statements to formal ones. If I submit such informal statements to an LLM and ask it to generate a spec or program to achieve some result, that can be made repeatable. There are various arrangements to make this more robust, like having another LLM generate test cases to check the work of the other. Does this qualify?
It's... "knee-jerk obvious". But is it actually true? People seem to be interested in the concept in formal logic arguments for example https://www.researchgate.net/publication/346658578_How_to_Fo... (which uses formal process for part of formalization), so maybe it's not as simple as it seems initially. I mean, if we're already talking about formal problems, it could use a stronger proof ;)
At best, this is a formal process for manipulating certain kinds of statements. But the general problem, "take a human's statement of a problem and translate it into a formal statement of a problem that, if solved, will address what the human was asking for" is far harder and more nebulous. Ultimately, it's exactly the problem that LLMs have been invented for, so it has been studied in that sense (and there is a broad literature in AI for NLP, algorithm finding, expert systems, etc). But no one would claim that they are even close to having a formal specification of this problem that they could analyze the complexity of.
I'm not saying that the statement "I want this button to be bigger" can't be formalized. I'm saying that there is no formal process you can follow to get from this problem to a formal problem that is equivalent. There isn't even a formal process you can use to check if a formal definition is equivalent to this problem.
Consider that if someone asked you solve this problem for them with just this statement, either of the following could be a sketch of a more formal statement of what they actually want:
1. In a given web page, the css class used for a particular <button> element should be changed to make the button's height larger by 10%, without changing any other <button> element on the page, or any other dimension.
2. For a particular piece of garment that you are given, the top most button must be replaced with a different button that appears to have the same color and finish to a human eye, and that has the same 3D shape up to human observational precision, but that has a radius large enough to not slip through the opposing hole under certain forces that are commonly encountered, but not so large that it doesn't fit in the hole when pushed with certain forces that are comfortable for humans.
I think you would agree that (a) someone who intended you to solve either of these problems might reasonably describe them with the statement I suggested, and (b), that it would be very hard to devise a formal mathematical process to go from that statement to exactly one of these statements.
Ah, gotcha. I agree it would be difficult. I’m still not convinced it would be impossible though.
LLMs could even formalise what you want in the context, even now.
Or do you mean that you can’t formalise every statement when given incomplete information about the context of the statement, since then we have a single word pointing to multiple different contexts?
Ah, you are informally inquiring about a formal description concerning the informal nature of formalization of informal questions.
Joke aside, this is about the nature of the formalization process itself. If the process of formalizing informal problems were fully formalized, it would be possible to algorithmically compute the solution and even optimize it mathematically. However, since this is obviously impossible (e.g. vague human language), it suggests that the formalization process can't be fully formalized.
My 2 cents: Since LLMs (Large Language Models) operate as at least a subset of Turing machines (which recognize recursively enumerable languages), the chain of thought (CoT) approach could be equivalent to or even more expressive than that subset. In fact, CoT could perfectly be a Turing machine.
If we leave CoT aside for a moment, it's worth exploring the work discussed in the paper "Neural Networks and the Chomsky Hierarchy"[1], which analyzes how neural networks (including LLMs) map onto different levels of the Chomsky hierarchy, with a particular focus on their ability to recognize formal languages across varying complexity.
How would that be remarkable, when it is exactly what he Universal Approximation Theorem already states? Since transformers also use fully connected layers, none of this should really come as a surprise. But from glancing at the paper, they don't even mention it.
It's 'remarkable' because (a) academic careers are as much about hype as science, (b) arxiv doesn't have peer review process to quash this, (c) people take arxiv seriously.
>How would that be remarkable, when it is exactly what he Universal Approximation Theorem already states
Only with infinite precision, which is highly unrealistic. Under realistic assumptions, fixed depth transformer without chain-of-thought are very limited in what they can express: https://arxiv.org/abs/2207.00729 . Chain of thought increases the class of problems which fixed depth transformers can solve: https://arxiv.org/abs/2310.07923
I'm waiting for peoples of AI to discover syllogism and inference in its original PROLOG sense, which this CoT abomination basically tries to achieve. Interestingly, if all logical content is translated to rules, and then only rules are fed into the LLM training set, what would the result be, and can the probabilistic magic be made into actually following reason without all the dice.
Right we’ve now gotten to the stage of this AI cycle where we start using the new tool to solve problems old tools could solve. Saying a transformer can solve any Formally decidable problem if given enough tape isn’t saying much. It’s a cool proof, don’t mean to deny that, but it doesn’t mean much practically as we already have more efficient tools that can do the same.
What I don't get is... didn't people prove that in the 90s for any multi-layer neural network? Didn't people prove transformers are equivalent on the transformers paper?
Yes they did. A two layer network with enough units in the hidden layer can form any mapping to any desired accuracy.
And a two layer network with single-delay feedback from the hidden units to themselves can capture any dynamic behavior (to any desired accuracy).
Adding layers and more structured architectures creates the opportunity for more efficient training and inference, but doesn't enable any new potential behavior. (Except in the sense that reducing resource requirements can allow impractical problems to become practical.)
Putting a 50 bucks bet that some very smart kid in the near future will come with some enthrophy-meets-graphical-structures theorem which gives an estimation of how the loss of information is affected by the size and type of the underlying structure holding this information.
It took a while for people to actually start talking about LZW as grammar algo, not a "dictionary"-based algorithm. Which is then reasoned about in a more general sense again by https://en.wikipedia.org/wiki/Sequitur_algorithm.
This is not to say that LLMs are not cool, we put them to use every day. But the reasoning part is never going to be a trustworthy one without a 100% discreet system, which can infer the syllogistic chain with zero doubt and 100% tracable origin.
I was thinking about the graphrag paper and prolog. I’d like to extract predicates. The source material will be inconsistent and contradictory and incomplete.
Using the clustering (community) model, an llm can summarize the opinions as a set of predicates which don’t have to agree and some general weight of how much people agree or disagree with them.
The predicates won’t be suitable for symbolic logic because the language will be loose. However an embedding model may be able to connect different symbols together.
Then you could attempt multiple runs through the database of predicates because there will be different opinions.
Then one could attempt to reason using these loosely stitched predicates. I don’t know how good the outcome would be.
I imagine this would be better in an interactive decision making tool where a human is evaluating the suggestions for the next step.
This could be better for planning than problem solving.
Hm... a RAG over DB of logical rules actually may be interesting. But loosely stitched predicates you can easily put to work with some random dice when you decide inference.
So how about we start thinking of AI as combination of the graphical probabilistic whatever which compresses the infromation from the training set in a very lossy manner; which is then hooked, internally or externally, with a discreet logical core, whenever CoT is needed. So this construct now can benefit from both worlds.
I’m surprised that understanding how to be thought unfolds is being considered not relevant to the answer. I have done a lot of problem solving in groups and alone. How thoughts develop seems fundamental to understand the solutions.
The story regarding the banning of terms that can be used with a reasoning system is a big red flag to me.
This sort of knee jerk reaction displays immature management and an immature technology product.
a little late to reply, but perhaps you see this. does it not make impression to you that lots of these articles on AI that get published are very childish. not in the math sense, but in the rasoning sense. besides, most of them are anything but interdisciplinary. I've almost never encountred prompt engineers who actually tried to delve into what GPTs do, and then these CoT guys, they don't know a thing about predicat logic, yet try to invent it anew.
On your comment reg. banning tokens/terms we are on the same page. We can agree all of this is very immature, and many of the peoples also, including this lot of chinese kids who seem to put out one paper per our. You see, the original seq2seq paper is 8 pages, topic included. Can you imagine? But Sutskever was ot a child back then, he was already deep into this all. We can easily state/assume the LLM business is in its infancy. It may easily stay there for a century until everyone levels up.
But didn't we already know that NN can solve any computable problem? The interesting thing is if they can be trained to solve any (computable) problem.
Does that mean when we reduce the precision of a NN, for example using bfloat16 instead of float32, we reduce the set of computational problems that can be solved.
How would that compare with a biological neural network with presumably near-infinite precision ?
First day of introductions to NN we were asked to create all the logic gates using artificial neurons, and then told "If you have all gates, you can do all computations".
I got to admit, I'm sorta sticking to that at face value, because I don't know enough computer science to a) discern if that is true and b) know what "f: X -> Y only for closed domains" means.
Only NNs of infinite size or precision. Under more realistic assumptions, transformers without chain of thought are actually limited in what they can solve: https://arxiv.org/abs/2207.00729
"What is the performance limit when scaling LLM inference? Sky's the limit.
We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.
The universal approximation theorem is good to know because says there's no theoretical upper bound to a function-approximating NN's accuracy. In practice it says nothing about what can be realistically achieved, though.
A key difference is that the way LMMs (Large Multimodal Models) generate output is far from random. These models can imitate/blend existing information or imitate/probably blend known reasoning methods in the training data. The latter is a key distinguishing feature of the new OpenAI o1 models.
Thus, the signal-to-noise ratio of their output is generally way better than infinite monkeys.
Arguably, humans rely on similar modes of "thinking" most of the time as well.
Yeah. Monkeys. Monkeys that write useful C and Python code that needs a bit less revision every time there's a model update.
Can we just give the "stochastic parrot" and "monkeys with typewriters" schtick a rest? It made for novel commentary three or four years ago, but at this point, these posts themselves read like the work of parrots. They are no longer interesting, insightful, or (for that matter) true.
If you think about it, humans necessarily use abstractions, from the edge detectors in retina to concepts like democracy. But do we really understand? All abstractions leak, and nobody knows the whole stack. For all the poorly grasped abstractions we are using, we are also just parroting. How many times are we doing things because "that is how they are done" never wondering why?
Take ML itself, people are saying it's little more than alchemy (stir the pile). Are we just parroting approaches that have worked in practice without real understanding? Is it possible to have centralized understanding, even in principle, or is all understanding distributed among us? My conclusion is that we have a patchwork of partial understanding, stitched together functionally by abstractions. When I go to the doctor, I don't study medicine first, I trust the doctor. Trust takes the place of genuine understanding.
So humans, like AI, use distributed and functional understanding, we don't have genuine understanding as meant by philosophers like Searle in the Chinese Room. No single neuron in the brain understands anything, but together they do. Similarly, no single human understands genuinely, but society together manages to function. There is no homunculus, no centralized understander anywhere. We humans are also stochastic parrots of abstractions we don't really grok to the full extent.
Great points. We're pattern-matching shortcut machines, without a doubt. In most contexts, not even good ones.
> When I go to the doctor, I don't study medicine first, I trust the doctor. Trust takes the place of genuine understanding.
The ultimate abstraction! Trust is highly irrational by definition. But we do it all day every day, lest we be classified as psychologically unfit for society. Which is to say, mental health is predicated on a not-insignificant amount of rationalizations and self-deceptions. Hallucinations, even.
That's just it. We're not unique. We've always been animals running on instinct in reaction to our environment. Our instincts are more complex than other animals but they are not special and they are replicable.
The infinite monkey post was in response to this claim, which, like the universal approximation theorem, is useless in practice:
"We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient."
Like an LLM, you omit the context and browbeat people with the "truth" you want to propagate. Together with many political forbidden terms since 2020, let us now also ban "stochastic parrot" in order to have a goodbellyfeel newspeak.
There is also a problem of "stochastic parrot" being constantly used in a pejorative sense as opposed to a neutral term to keep grounded and skeptical.
Of course, it is an overly broad stroke that doesn't quite capture all the nuance of the model but the alternative of "come on guys, just admit the model is thinking" is much worse and has much less to do with reality.
Chatgpt was released November 2022. That's one year and 10 months ago. Their marketing started in the summer of the same year, still far of from 3-4 years.
But chatgpt wasnt the first, openai had coding playground with gpt2, and you could already code even before that, around 2020 already, so I'd say it has been 3-4years
GPT-3 paper announcement got 200 comments on HN back in 2020.
It doesn't matter when marketing started, people were already discussing it in 2019-2020.
Stochastic parrot: The term was coined by Emily M. Bender[2][3] in the 2021 artificial intelligence research paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? " by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell.[4]
> Tired arguments from pro-IP / copyright sympathizers
You forgot "Tired ClosedAI joke from anti-IP / copyleft sympathizers".
Remember that the training data debate is orthogonal to the broader debate over copyright ownership and scope. The first people to start complaining about stolen training data were the Free Software people, who wanted a legal hook to compel OpenAI and GitHub to publish model weights sourced from GPL code. Freelance artists took that complaint and ran with it. And while this is technically an argument that rests on copyright for legitimacy; the people who actually own most of the copyrights - publishers - are strangely interested about these machines that steal vast amounts of their work.
These papers become increasingly difficult to properly comprehend.
Feed it to ChatGPT and ask for an explanation suited to your current level of understanding (5-year-old, high-school, undergrad, comp-sci grad student, and so on.)
No, really, I've tried it and its okay for a crow's flight over these papers, but I'd never put my trust in random() to fetch me precisely what I'm looking for.
My daily usage of ChatGPT, Claude, etc. for nearly 2 years now shows one and the same - unless I provide enough of the right context for it to get the job done, job is never done right. ever. accidentally maybe, but never ever. and this becomes particularly evident with larger documents.
The pure RAG-based approach is a no go, you cannot be sure important stuff is not omitted. The "feed the document into Context" still by definition will not work correctly thanks to all the bias accumulated in the LLMs layers.
So it is a way to approach papers if you really know what they contain, and know the surrounding terminology. But this is really a no go if you read about .... complex analysis and know nothing about algebra in 5th degree. Sorry, this is not gonna work, and will probably summarily take longer as total time/energy on behalf of the reader.
>* Claiming it's predictive text engine that isn't useful for anything
This one is very common on HN and it's baffling. Even if it's predictive text, who the hell cares if it achieves its goals? If an LLM is actually a bunch of dolphins typing on a keyboard made for dolphins, I could care less if it does what I need it to do. For people who continue to repeat this on HN, why? I just want to know out of my curiosity.
>* AI will never be able to [some pretty achievable task]
Also very common on HN.
You forgot the "AI will never be able to do what a human can do in the exact way a human does it so AI will never achieve x".
> Even if it's predictive text, who the hell cares if it achieves its goals?
Haha ... well in the literal sense it does achieve "its" goals, since it only had one goal which was to minimize its training loss. Mission accomplished!
OTOH, if you mean achieving the user's goals, then it rather depends on what those goals are. If the goal is to save you typing when coding, even if you need to check it all yourself anyway, then I guess mission accomplished there too!
Partly because "artificial intelligence" is a loaded phrase which brings implications of AGI along for the ride, partly because "intelligence" is not a well defined term, so an artificial version of it could be argued to be almost anything, and partly because even if you lean on the colloquial understanding of what "intelligence" is, ChatGPT (and its friends) still isn't it. It's a Chinese Room - or a stochastic parrot.
It's not incredulity, just pointing out the obvious. Searle placed very specific limitations on the operator of the Room. He rests his whole argument on the premise that the operator is illiterate in Chinese, or at least has no access to the semantics of the material stored in the Room. That's plainly not the case with ChatGPT, or it couldn't review its previous answers to find and fix its mistakes.
And you certainly would not get a different response, much less a better one, from the operator of a Chinese Room simply by adding "Think carefully step by step" to the request you hand him.
It's just a vacuous argument from square one, and it annoys me to an entirely-unreasonable extent every time someone brings it up. Add it to my "Stochastic Parrot" and "Infinite Monkeys" trigger phrases, I guess.
> ... He rests his whole argument on the premise that the operator is illiterate in Chinese, or at least has no access to the semantics of the material stored in the Room.
> That's plainly not the case with ChatGPT, or it couldn't review its previous answers to find and fix its mistakes.
Which is another way of saying, ChatGPT couldn't produce semantically correct output without understanding the input. Disagreeing with which is the whole point of the Chinese Room argument.
Why cannot the semantic understanding be implicitly encoded in the model? That is, why cannot the program I (as the Chinese Room automaton) am following be of sufficient complexity that my output appears to be that of an intelligent being with semantic understanding and the ability to review my answers? That, in my understanding, is where the genius of ChatGPT lies - it's a masterpiece of preprocessing and information encoding. I don't think it needs to be anything else to achieve the results it achieves.
A different example of this is the work of Yusuke Endoh, whom you may know for his famous quines. https://esoteric.codes/blog/the-128-language-quine-relay is to me one of the most astonishing feats of software engineering I've ever seen, and little short of magic - but at its heart it's 'just' very clever encoding. Each subsequent program understands nothing and yet encodes every subsequent program including itself. Another example is DNA; how on Earth does a dumb molecule create a body plan? I'm sure there are lots of examples of systems that exhibit such apparently intelligent and subtly discriminative behaviour entirely automatically. Ant colonies!
> And you certainly would not get a different response, much less a better one, from the operator of a Chinese Room simply by adding "Think carefully step by step" to the request you hand him.
Again, why not? It has access to everything that has gone before; the next token is f(all the previous ones). As for asking it to "think carefully", would you feel differently if the magic phrase was "octopus lemon wheat door handle"? Because it doesn't matter what the words mean to a human - it's just responding to the symbols it's been fed; the fact that you type something meaningful to you just obscures that fact and lends subconscious credence to the idea that it understands you.
> It's just a vacuous argument from square one,
and it annoys me to an entirely-unreasonable extent every time someone brings it up. Add it to my "Stochastic Parrot" and "Infinite Monkeys" trigger phrases, I guess.
With no intent to annoy, I hope you at least understand where I'm coming from, and why I think those labels are not just apt, but useful ways to dispel the magical thinking that some (not you specifically) exhibit when discussing these things. We're engineers and scientists and although it's fine to dream, I think it's also fine to continue trying to shoot down the balloons that we send up, so we're not blinded by the miracle of flight.
Why cannot the semantic understanding be implicitly encoded in the model?
That just turns the question into "OK, so what distinguishes the model from a machine capable of genuine understanding and reasoning, then?"
At some point you (and Searle) must explain what the difference is in engineering terms, not through analogy or by appeals to ensoulment or by redecorating the Chinese Room with furnishings it wasn't originally equipped with. Having moved the goalpost back to the far corner of the parking garage already, what's your next move?
It's easy to dismiss a "stochastic parrot" by saying that "The next token is a function of all of the previous ones," but welcome to our deterministic universe, I guess... deterministic, that is, apart from the randomness imparted by SGD or thermal noise or what-have-you. Again, how is this different from what human brains do? Von Neumann himself naturally assumed that stored-program machines would be modeled on networks of neuron-like structures (a factoid I just ran across while reading about McCullough and Pitts), so it's not that surprising that we're finally catching up to his way of looking at it.
At the end of the day we're all just bags of meat trying to minimize our own loss functions. There's nothing special about what we're doing. The magical thinking you're referring to is being done by those who claim "AI isn't doing X" or "AI will never do X" without bothering to define X clearly.
I don't think it needs to be anything else to achieve the results it achieves.
Exactly, and that's earth-shaking because of the potential it has to illuminate the connection between brains and minds. It's sad that the discussion inevitably devolves into analogies to monkeys and parrots.
> That just turns the question into "OK, so what distinguishes the model from a machine capable of genuine understanding and reasoning, then?"
And that's a great question which is not far away from asking for definitions of intelligence and consciousness, which of course I don't have, however I could venture some suggestions about what we have that LLMs don't, in no particular order:
- Self-direction: we are goal-oriented creatures that will think and act without any specific outside stimulus
- Intentionality: related to the above - we can set specific goals and then orient our efforts to achieve them, sometimes across decades
- Introspection: without guidance, we can choose to reconsider our thoughts and actions, and update our own 'models' by deliberately learning new facts and skills - we can recognise or be given to understand when we're wrong about something, and can take steps to fix that (or choose to double down on it)
- Long term episodic memory: we can recall specific facts and events with varying levels of precision, and correlate those memories with our current experiences to inform our actions
- Physicality: we are not just brains in skulls, but flooded with all manner of chemicals that we synthesise to drive our biological functions, and which affect our decision making processes; we are also embedded in the real physical world and recieving huge amounts of sensory data almost constantly
> At some point you (and Searle) must explain what the difference is in engineering terms, not through analogy or by appeals to ensoulment or by redecorating the Chinese Room with furnishings it wasn't originally equipped with. Having moved the goalpost back to the far corner of the parking garage already, what's your next move?
While I think that's a fair comment, I have to push back a bit and say that if I could give you a satisfying answer to that, then I may well be defining intelligence or consciousness and as far as I know there are no accepted definitions for those things. One theory I like is Douglas Hofstadter's strange loop - the idea of a mind thinking about thinking about thinking about itself, thus making introspection a primary pillar of 'higher mental functions'. I don't see any evidence of LLMs doing that, nor any need to invoke it.
> It's easy to dismiss a "stochastic parrot" by saying that "The next token is a function of all of the previous ones," but welcome to our deterministic universe, I guess... deterministic, that is, apart from the randomness imparted by SGD or thermal noise or what-have-you. Again, how is this different from what human brains do?
...and now we're onto the existence or not of free will... Perhaps it's the difference between automatic actions and conscious choices? My feeling is that LLMs deliberately or accidentally model a key component of our minds, the faculty of pattern matching and recall, and I can well imagine that in some future time we will integrate an LLM into a wider framework that includes other abilities that I listed above, such as long term memory, and then we may yet see AGI. Side note that I'm very happy to accept the idea that each of us encodes our own parrot.
> Von Neumann himself naturally assumed that stored-program machines would be modeled on networks of neuron-like structures (a factoid I just ran across while reading about McCullough and Pitts), so it's not that surprising that we're finally catching up to his way of looking at it.
Well OK but very smart people in the past thought all kinds of things that didn't pan out, so I'm not really sure that helps us much.
> At the end of the day we're all just bags of meat trying to minimize our own loss functions. There's nothing special about what we're doing. The magical thinking you're referring to is being done by those who claim "AI isn't doing X" or "AI will never do X" without bothering to define X clearly.
I don't see how that's magical thinking, it's more like... hard-nosed determinism? I'm interested in the bare minimum necessary to explain the phenomena on display, and expressing those phenomena in straightforward terms to keep the discussion grounded. "AI isn't doing X" is a response to those saying that AI is doing X, so it's as much on those people to define what X is; in any case I rather prefer "AI is only doing Y", where Y is a more boring and easily definable thing that nonetheless explains what we're seeing.
> Exactly, and that's earth-shaking because of the potential it has to illuminate the connection between brains and minds.
Ah! Now there we agree entirely. Actually I think a far more consequential question than "what do LLMs have that makes them so good?" is "what don't we have that we thought we did?".... but perhaps that's because I'm an introspecting meat bag and therefore selfishly fascinated by how and why meat bags introspect.
One question, if anyone knows the details: does this prove that there exists a single LLM that can approximate any function to arbitrary precision given enough CoT, or does it prove that for every function, there exists a Transformer that fits those criteria?
That is, does this prove that a single LLM can solve any problem, or that for any problem, we can find an LLM that solves it?
If it's possible to find an LLM for any given problem, then find an LLM for the problem "find an LLM for the problem and then evaluate it" and then evaluate it, and then you have an LLM that can solve any problem.
It's the "Universal Turing Machine" for LLMs.
I wonder what's the LLM equivalent of the halting problem?
A closer analogy is the Hutter Search (http://hutter1.net/ai/pfastprg.pdf), as it is also an algorithm that can solve any problem. And it is probably too inefficient to use in practice, like the Hutter Search.
Theoretical results exist that try to quantify the number of CoT tokens needed to reach different levels of computational expressibility:
https://arxiv.org/pdf/2310.07923
TL;DR: Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.
For a field that constantly harps on parallelism and compute efficiency, this requirement seems prohibitive.
We really need to get away from constant depth architectures.
> Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.
So, as stated, this is impossible since it violates the Time Hierarchy Theorem.
The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens. Which is... not a particularly impressive bound? Any non-trivial fixed neural network can, for instance, compute the NAND of two inputs. And any polynomial computable function can be computed with a polynomial number of NAND gates.
To be clear I think the tweet is a bit exaggerated (and the word ‘performance’ there doesn’t take into account efficiency, for example) but I don’t have the time to read the full paper (just skimmed the abstract and conclusion). I quoted the tweet by an author for people to discuss since it’s still a fairly remarkable result.
This is an accepted ICLR paper by authors from Stanford, Toyota and Google. That's not a guarantee for anything, of course, but they likely know basic algorithms and the second law.
You can certainly argue against their claims, but you need to put in the legwork.
> We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.
That seems like a bit of a leap here to make this seem more impressive than it is (IMO). You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
Wake me up when a LLM figures out stable fusion or room temperature superconductors.
I think you're misrepresenting the study. It builds upon previous work that examines the computation power of the transformer architecture from a circuit-complexity perspective. Previous work showed that the class of problems that a "naive" Transformer architecture could compute was within TC0 [1, 2] and as a consequence it was fundamentally impossible for transformers to solve certain classes of mathematical problems. This study actually provides a more realistic bound of AC0 (by analyzing the finite-precision case) which rules out even more problems, including such 'simple' ones as modular parity.
We also had previous work that hinted that part of the reason why chain-of-thought works from a theoretical perspective is that it literally allows the model to perform types of computations it could not under the more limited setting (in the same way jumping from FSMs to pushdown automata allows you to solve new types of problems) [3].
Generally, literature on the computational power of the SAME neural architecture can differ on their conclusions based on their premises. Assuming finite precision will give a more restrictive result, and assuming arbitrary precision can give you Turing completeness.
From a quick skim this seems like it's making finite precision assumptions? Which doesn't actually tighten previous bounds, it just makes different starting assumptions.
You can't really be blamed though, the language in the paper does seem to state what you originally said. Might be a matter of taste but I don't think it's quite accurate.
The prior work they referenced actually did account for finite precision cases and why they didn't think it was useful to prove the result with those premises.
In this work they simply argued from their own perspective why finite precision made more sense.
The whole sub-field is kinda messy and I get quoted differing results all the time.
Edit: Also, your original point stands, obviously. Sorry for nitpicking on your post, but I also just thought people should know more about the nuances of this stuff.
Density has continued to increase, but so have prices.
The 'law' was tied to the price to density ratio, and it's been almost a decade now since it died.
if you take reproduction into account and ignore all the related externalities you can definitely double your count of transistors (humans) every two years.
> You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
Isn’t this a good thing since compute can be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
Say humans can solve quantum gravity in 100 years of thinking by 10,000 really smart people. If one AGI is equal to 1 really smart person. Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
The major assumption here is that transformers can indeed solve every problem humans can.
> Isn’t this a good thing since compute and be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
But it can't. There isn't enough planet.
> The major assumption here is that transformers can indeed solve every problem humans can.
No, the major assumptions are (a) that ChatGPT can, and (b) that we can reduce the resource requirements by many orders of magnitude. The former assumption is highly-dubious, and the latter is plainly false.
Transformers are capable of representing any algorithm, if they're allowed to be large enough and run large enough. That doesn't give them any special algorithm-finding ability, and finding the correct algorithms is the hard part of the problem!
Are we talking about "an AGI", or are we talking about overfitting large transformer models with human-written corpora and scaling up the result?
"An AGI"? I have no idea what that algorithm might look like. I do know that we can cover the majority of cases with not too much effort, so it all depends on the characteristics of that long tail.
> Combining Wu’s method with the classic synthetic methods of deductive databases and angle, ratio,
and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of
5 minutes per problem.
AlphaGeometry had an entire supercomputer cluster, and dozens of hours. GOFAI approaches have a laptop and five minutes. Scale that inconceivable inefficiency up to AGI, and the total power output of the sun may not be enough.
It's always a hindsight declaration though. Currently we can only say that Intel has reused the same architecture several times already and cranking up the voltage until it breaks because they seem to be yet to find the next design leap, while AMD has been toying around with 3D placement but their latest design is woefully unimpressive. We do not know when the next compute leap will happen until it happens.
> Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
That is wrong, it misses the point. We learn from the environment, we don't secrete quantum gravity from our pure brains. It's a RL setting of exploration and exploitation, a search process in the space of ideas based on validation in reality. A LLM alone is like a human locked away in a cell, with no access to test ideas.
If you take child Einstein and put him on a remote island, and come back 30 years later, do you think he would impress you with is deep insights? It's not the brain alone that made Einstein so smart. It's also his environment that had a major contribution.
if you told child Einstein that light travels at a constant speed in all inertial frames and taught him algebra, then yes, he would come up with special relativity.
in general, an AGI might want to perform experiments to guide its exploration, but it's possible that the hypotheses that it would want to check have already been probed/constrained sufficiently. which is to say, a theoretical physicist might still stumble upon the right theory without further experiments.
Labeling of observations better than a list of column label strings at the top would make it possible to mine for insights in or produce a universal theory that covers what has been observed instead of the presumed limits of theory.
CSVW is CSV on the Web as Linked Data.
With 7 metadata header rows at the top, a CSV could be converted to CSVW; with URIs for units like metre or meter or feet.
If a ScholarlyArticle publisher does not indicate that a given CSV or better :Dataset that is :partOf an article is a :premiseTo the presented argument, a human grad student or an LLM needs to identify the links or textual citations to the dataset CSV(s).
Easy: Identify all of the pandas.read_csv() calls in a notebook,
Expensive: Find the citation in a PDF, search for the text in "quotation marks" and try and guess which search result contains the dataset premise to an article;
Or, identify each premise in the article, pull the primary datasets, and run an unbiased automl report to identify linear and nonlinear variance relations and test the data dredged causal chart before or after manually reading an abstract.
Assumption is that the AGI can solve any problem humans can - including learning from the environment if that is what is needed.
But I think you're missing the point of my post. I don't want to devolve this topic into yet another argument centered around "but AI can't be AGI or can't do what humans can do because so and so".
I often see this misconception that compute alone will lead us to surpass human level. No doubt it is inspired by the "scaling laws" we heard so much about. People forget that imitation is not sufficient to surpass human level.
Sort of like quantum superposition state? So here is an idea, using quantum to produce all possible inferences and use some not yet invented algorithms to collapse to the final result
Has it been publicly benchmarked yet, if this approach:
Hello LLM, please solve this task: <task>
Can be improved by performing this afterwards?
for iteration in range(10):
Hello LLM, please solve this task: <task>
Here is a possible solution: <last_reply>
Please look at it and see if you can improve it.
Then tell me your improved solution.
I think o1 is more like "pretend you're doing a job interview, think step and show your working".
I tried something similar to the suggested iterative loop on a blog post I'd authored but wanted help copy editing; first few were good enough, but then it got very confused and decided the blog post wasn't actually a blog post to be edited and instead that what I really wanted to know was the implications of Florida something something Republican Party.
Benchmark would be neat, because all I have is an anecdote.
It's not clear me what you're saying; isn't the whole deal here that by performing RL on the CoT (given sufficient size and compute) it would converge to the right program?
1) The theoretical notion that a fixed depth transformer + COT can solve arbitrary problems involving sequential computation is rather like similar theoretical notions of a Turing machine as universal computer, or of an ANN with a hidden layer able to represent arbitrary functions .. it may be true, but at the same time not useful
2) The Turing machine, just as the LLM+COT, is only as useful as the program it is running. If the LLM-COT is incapable of runtime learning and just trying to mimic some reasoning heuristics, then that is going to limit it's function, even if theoretically such an "architecture" could do more if only it were running a universal AGI program
Using RL to encourage the LLM to predict continuations according to some set of reasoning heuristics is what it is. It's not going to make the model follow any specific reasoning logic, but is presumably hoped to generate a variety of continuations that the COT "search" will be able to utilize to arrive at a better response than it otherwise would have done. More of an incremental improvement (as reflected in the benchmark scores it achieves) than "converging to the right program".
Sometimes reading hackernews makes me want to slam my head on the table repeatedly. Given sufficient size and compute is one of the most load bearing phrases I've ever seen.
But it is load bearing. I mean, I personally can't stop being amazed at how with each year that passes, things that were unimaginable with all the world's technology a decade ago are becoming straightforward to run on a reasonably priced laptop. And at this stage, I wouldn't bet even $100 against any particular computational problem being solved in some FAANG datacenter by the end of the decade.
Technology advances, but it doesn't invent itself.
CPUs didn't magically get faster by people scaling them up - they got faster by evolving the design to support things like multi-level caches, out-of-order execution and branch prediction.
Perhaps time fixes everything, but scale alone does not. It'll take time for people to design new ANN architectures capable of supporting AGI.
> We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.
A reply in this twitter thread links to a detailed blog post titled "Universal computation by attention: Running cellular automata and other programs on Claude 3 Opus." https://x.com/ctjlewis/status/1786948443472339247
Can any of these tools do anything that the Github copilot cannot do? (Apart from using other models?). I tried Continue.dev and cursor.ai, but it was not immediately obvious to me. Maybe I am missing something workflow specific?
No. "inherently serial" refers to problems that are specified serially and can't be spend up by parallel processing. The sum of a set of N numbers is an example of a problem that is not inherently serial. You can use parallel reduction to perform the computation in O(log(N)) time on an idealized parallel computer but it takes O(N) time on an idealized serial computer.
And, it turns, exactly which problems are really are inherently serial is somewhat challenging problem.
They didn't say floats, and the sum of a set of floats is not uniquely defined as a float for the rain you stated, at least not without specifying a rounding mode. Most people use "round to whatever my naïve code happens to do" which has many correct answers. To add up a set of floats with only the usual 0.5ULP imprecision, yes, isn't trivial.
Using hardware floating point types is not suitable if mathematical correctness matters, and is largely a deprecated practice. Check out Python’s fraction module for example, for exact arithmetic[0].
You can model parallel computation by an arbitrary finite product of Turing machines. And then, yes, you can simulate that product on a single Turing machine. I think that's the sort of thing you have in mind?
But I'm not aware of what "inherently serial" means. The right idea likely involves talking about complexity classes. E.g. how efficiently does a single Turing machine simulate a product of Turing machines? An inherently serial computation would then be something like a problem where the simulation is significantly slower than running the machines in parallel.
Yeah it's talking about a new feature for LLMs where the output of an LLM is fed back in as input and done again and again and again and this produces way more accurate output.
No worries! With the magic bananas and ink you've acquired, those monkeys will surely produce output with a signal-to-noise ratio rivaling the best LLMs.
I’m sure your startup will achieve the coveted Apeicorn status soon!