I built a few multi agent systems and went down a rabbit hole where I reached an important conclusion - From the perspective of the LLM, the prompt/context is the only thing that ever matters. Everything about how your agent will behave ultimately boils down to this.
I had a bunch of fancy stuff like agents collaborating by passing messages and interpreting them with their own prompts and function calls. Then I realized I could collapse all of my "agents" into one dynamic prompt that tracks state in a stupid simple text region. Passing messages around was playing in very expensive traffic at the end of the day.
This is ultimately about information and spinning up an entire matrix of "agents" to process stream of info from A to B seems quite wasteful when many clear alternatives exist.
If we are seeking emergence, then perhaps this mental model still fits better. But, for practical targeted solutions, I think it's a huge distraction.
The best analogy I can think of is that if you want your agent to accomplish something without persisting a long chat history, but instead use the agent to reorganize and change the prompt, you would choose to use the method Leonard uses in the film "Memento". Due to his condition, Leonard cannot form new memories and struggles to recall events that occur after his injury. Leonard knows his condition and uses tattoos to record facts "between sessions". Each chat completion of the LLM is similar to "each session" Leonard experiences. The prompt, with the help of the agent, can persist across LLM chat completions, which is similar to the tattoos, and various notes and photos he has.
That's true if you're passing messages between identical models. There's a question to ask as to whether different models trained for different tasks would be better than single, multipurpose models though. My gut feel is that eventually multipurpose models will win because you don't have the embedded cost of relearning what syntactic structure is, but for a given training time and number of weights it's not clear whether that's true today.
Yeah, same principle. If you're passing messages between things that will react exactly the same to the same prompt, there's not a lot of point (unless the parallelism is important). If you've got fine-tunes, the whole point is that they will be better at some questions than the baseline.
Mind you, there's another idea there about mixture of experts as implemented by deciding which fine-tune to load, depending on the prompt itself... I'm sure that's been looked at.
Maybe the LLM needs something like dantes Divina Commedia as previous instances describe in condensed forms why a previous prompts conclussions failed, to succesfully navigate around the trained in local minimas. I diary of failure to keep track on how to reach a success.
We do something similar with Magic Loops[0], but within the context of generating a single "loop" (automation).
We've found that LLMs are pretty bad at prompting other LLMs, unless the problem at hand is very limited in scope. It's too easy to get incorrect/expensive behavior otherwise (e.g. starts building a framework against an imaginary API, instead of using an existing tool).
Our approach looks more like a state machine under the hood, mostly code with LLM-based "magic" sprinkled throughout.
The tool can both edit code and LLM "blocks" as it sees fit, allowing it to change it's functionality and prompting dynamically.
Interestingly, we first set the validate->fix threshold to N=3, but the "agent" often gets stuck in a pattern of retries based on low quality user input.
[0] Feel free to give our tool a try, it's very much in an alpha state: https://magicloops.dev/
That's what keeps LLMs from being self-sufficient it seems. Life has this peculiar ability to always keep basic necessities at the center, and always default to them, so the loop can keep going:
1. Try your hardest not to die.
2. Try your hardest to reproduce.
3. Try to thrive maybe also while at it, but DON'T FORGET 1 AND 2!
This is why humans are this weird combination of instinct and reasoning that we find hard to control. The limbic system is very basic, but very strong and very stable. The primary loop. Everything else allows us to go farther, but when we fail we go back to relying on the limbic system.
If we allowed reason to take over instinct, we'd likely end ourselves. Not that we can't end ourselves by also relying on instinct but essentially the strategy there is: sacrifice the intelligent shell and let the basic loop continue so we can rebuild the intelligent shell in a new, more stable and suited for the environment form.
One interesting detail in the Code As Policies paper from last year was that they generate code from the given API, then recursively get the LLM to implement any functions from what it generates that don't already exist. I thought that was quite neat.
I've started seeing those sort of hallucinated API calls as a signal for something that ought to exist; if it's predictable enough for an LLM to think it might, then maybe it should, to make it easier for humans too?
> starts building a framework against an imaginary API, instead of using an existing tool
If you swap “API” for “business need” you’ve got a basic software developer. Most of the time those imaginary business needs are a waste of time, but every once in a while the solution is pure genius to a problem no one realised they had. Mostly the first though.
Here I've created a basic Paint program with no pens, brushes or other drawing tools. Instead you get the AI to create the painting tools for you and it writes, tests and deploys the JavaScript for you live, usually within 30 seconds.
It only uses OpenAI's gpt-3.5-turbo model too which is fast and good enough for this use case.
It is "experimental". The AI sometimes produces code that doesn't fully work (this is part of the fun). Refresh and try again. It works 90% of the time I would say.
Awesome to see. We are productionizing a variant of this in louie.ai and it's the biggest step function in system quality we've seen in ~months. This kind of thing enables continuous learning & user-directed learning without having to formally fine-tune a model - magically instant from the user perspective -- and fits cleanly within a RAG system in general.
Of course, once you rephrase this as a learning problem, a lot of new questions pop up on what it means to do it right :)
Is this anywhere close to create microagent that takes you (the author, "aymenfurter") out of the equation for this repo itself? Like to make yourself redundant and let the bot be in charge of this repo and all that entails? How far in time?
That's not a fitting analogy because there is no equivalent to the laws of thermodynamics that says a system can't be self improving. You're dismissing a technological possibility without much reason.
We are extremely similar genetically to our ancestors of 100k years ago. The big difference is cultural inheritance. We come up with ideas and objects and pass them down. Humans have improved the capabilities of humans. There's no reason to think computer programs aren't capable of the same.
It’s not that straightforward. Quoting Yann LeCun from a LinkedIn post [1]:
> I have claimed that Auto-Regressive LLMs are exponentially diverging diffusion processes.
Here is the argument:
Let e be the probability that any generated token exits the tree of "correct" answers.
Then the probability that an answer of length n is correct is (1-e)^n.
> Errors accumulate. The probability of correctness decreases exponentially.
One can mitigate the problem by making e smaller (through training) but one simply cannot eliminate the problem entirely.
A solution would require to make LLMs non auto-regressive while preserving their fluency.
Moreover, I think the human analogy doesn’t quite fit here. There’s no evidence that today’s human is cognitively more capable than their human ancestors. It’s a conflation of technology and transference of knowledge with intelligence.
I guess I should be wiser than contradicting LeCun on a public forum, but his math doesn't really work out. It only works if there is a unique correct answer to any question, in which case e=1/dict_size which is clearly false - even humans can just put "Let me think..." in front of anything else and get logically "the same" answer; after how many such filler words do you deem it "wrong"? Even taking colloquial speech aside, in Math, there is apparently a book that collects 367 different proofs of the Pythagorean Theorem; that tree of correct answers is certainly quite complicated. You can't approximate it with a fixed probability and take the exponential - the number of possibly correct next tokens varies very strongly depending on the previous string.
LLMs and autoregression are very good at avoiding the 99.999[...]% of the strings that are simple gibberish. If you were to generate a string made of 20 random tokens from GPT2 tokenizer, you would get something like:
and of course any half decent language model does much better than that. If the "paths to truth" were as unlikely as LeCun puts it there would be no hope.
A non-autoregressive model would certainly be "better" because it would be faster, which is where language models started from (BERT & co.), it just doesn't seem to work as well... similarly to how a human sometimes needs to write something down and only realizes the correct answer to a complicated question on the go.
If anything, we'd need to allow LLMs to realize mistakes and correct themselves out of them, i.e. making the generation non-linear. If you ask GPT4 something complicated (like math) it's not rare at all that it logically contradicts itself in their answers. I would be surprised if, somewhere deep in the model, it doesn't "realize" this, but it can't fix it, so it falls back to what humans do at an exam or interview that started badly: try to bullshit their way out of the thing, sweeping the inconsistency under the carpet, unless you explicitly point it to them (and often even after that, both GPT4 and humans).
P.S. Mathematician rant: who on Earth calls a probability "e"??
> There’s no evidence that today’s human is cognitively more capable than their human ancestors. It’s a conflation of technology and transference of knowledge with intelligence.
That was their point. See here:
> We are extremely similar genetically to our ancestors of 100k years ago. The big difference is cultural inheritance. We come up with ideas and objects and pass them down.
Yup, hence the “unfit” analogy. It doesn’t correspond well with LLMs getting more intelligent in the same way, which I believe, is intended by having a “self editing agent”.
In the linked project, the LLM edits its prompt. I think this is similar enough to humans improving our capabilities through ideas (such as language) that it counters the complete dismissal LLMs improving themselves as if it violated the laws of thermodynamics.
Now you're falsely equating biological brains with giant arrays of floating point numbers.
we have NOT evolved where we are today by generating inputs to ourselves, watching our own output and modifying our inputs to watch again our outputs.
Rather - we had fundamental strong reasoning almost flawless rigorous (formally documented) capabilities from the get go. Interference, induction, deduction. LLMs have none of that and it's documented all over multiple times over.
We have built at that on top of those faculties again - by not giving input to us and observing our outputs rather poking around into the world with are reasoning and cognitive faculties and arranging/categorising carefully what we discovered.
So for fanboys, this surely is very huge and I respect that sentiment wholeheartedly.
> Rather - we had fundamental strong reasoning almost flawless rigorous (formally documented) capabilities from the get go.
What are you talking about? This doesn't describe the vast majority of human knowledge.
You're dismissing the possibility of new technology via a real loose analogy and insulting anyone who disagrees. This isn't what good reasoning looks like.
GANs already exist. The linked project already exists.
> A model will generate its own input and will watch it's own output and in process, will become more intelligent than it really is.
If you sub out the needless contradiction of "more intelligent than it really is" for "more intelligent than it was", then this is something that already happens. It will continue to happen whether you believe in it or not.
You're acting like you're arguing against people with sketches of perpetual motion devices. You're not. This is like arguing against heavier-than-air flight in 1905. You are way behind.
> What are you talking about? This doesn't describe the vast majority of human knowledge.
I hope you don't mean to say that vast majority of knowledge is devoid of any consistent reasoning and is just hallucinated on the way as an LLM does.
> It will continue to happen whether you believe in it or not.
We'll see. I believe post transformers, next AI winter is around the corner - for a while.
What I see above is an LLM chewing its own output with light modifications as input and that's not going to lead anywhere as the README of the project itself clearly notes. It gets stuck.
> we have NOT evolved where we are today by generating inputs to ourselves
Sure, we have not evolved this way in the strictly biological sense, but we did greatly extend our capabilities this way. The jump in capabilities in the past tens of thousands of years is mostly from passing knowledge around and incrementally building on top of it. Often by building higher level abstractions, and building more complex systems on top of those, or by dividing the problem into distinct parts, and having people specialize.
Small nitpick: I think you mean the second law of thermodynamics.
Also the second law of thermodynamics is an empirical law. It is based on observations and not proven.
To say, that a perpetual motion machine can't exist because of the 2nd law, is circular logic, since the 2nd law was established precisely because a perpetual motion machine was never observed.
That's from the POV of thermodynamics, though, you can show with statistical mechanics, that the probability for a process that defies this laws goes to zero.
An AI tried to tell me the {} opens a new block scope in PHP and any variables in it are scoped to that block. I nearly lol’d and the code it gave me was so wrong it was cringey. Hopefully it is better at Python.
Google and many people will tell you the same thing. So you must have a lot of lol moments in a day.
Gpt4 is quite good (better than most humans I ever met) at php, but bad at facts; don’t ask it facts, ask it to write code. That’s what you would ask a human (outside interviews).
Disclaimer; I am formally trained with proofs and proof assistants and I hate the current timeline where we ask ai to drivel up code, but I cannot say it’s bad at doing it; it just is not necessarily sound or even working code sure, but that’s the same as with most human first tries. Then you iterate and make better. My days of ‘I have proven it correct, now I just have to type it in’ are long gone, at least for things that pay for my bread.
When saying LLMs are "good at writing code" there's a distinction between
"good at taking high-level English-language solutions to programming problems and translating them into low-level implementations via a specific language / design patterns / etc"
versus
"good at finding solutions to programming problems."
GPT-4 is indeed quite good at the former - and the former is what most enterprise programming work actually is. A lot of the hard part of professional software development is understanding the problem well enough to describe a solution in English: once you do that, writing the C# or Java is typically somewhat rote. Likewise LLMs are genuinely useful when you know exactly what you want to do with a 3rd-party library, but have to trawl through a bunch of API documentation to figure out the magic words.
All that said, LLMs still really suck at the latter problem: https://www.aisnakeoil.com/p/gpt-4-and-professional-benchmar... OpenAI's benchmarks fall off quite badly with actual programming benchmarks, compared to simpler tests of code generation. If it involves managing state, creating novel data structures, counting to numbers higher than 3, etc, LLMs just aren't smart enough.
You are right, hence my disclaimer. For paid work, it works fine and that is indeed enterprise kind of mindnumbing plumbing. For other work, I do not use it (but will try every few months if anything changed of course).
I do! I use code generation for non-work-related things, then spend time refactoring it and cleaning up the mess.
AI is pretty decent but lacks the ability to write "maintainable" software, which is easy to extend, and replace. It works as a fantastic starting point in that it fills in all the boilerplate code.
That's a separate project called ReplicatorAgent. Unfortunately the Stargate program has had a number of run ins with them and the Air Force is merciless in shutting those kinds of projects down.
lol posted the same thing above. I’m glad I’m not the only one who thought that was an extremely powerful read! As someone currently trying to imbue silicon with an eternal soul, it’s very sobering.
AGI feels like a marketing term to me now. I mainly see it as just research into how can we improve and scale the current model architectures we have to be better than the last one
Eh regardless of all that the answer is just “yes” imo: self-improvement is a necessary step to meaningful superintelligence. Perhaps not AGI but it obviously seems like a massive help if not absolutely strictly necessary.
In a way, that’s what modern multi-stage-trained foundational models already do: improve their weights intelligently. Having the same results in human readable code (what this is a first step towards) would be a lot more powerful…
I had a bunch of fancy stuff like agents collaborating by passing messages and interpreting them with their own prompts and function calls. Then I realized I could collapse all of my "agents" into one dynamic prompt that tracks state in a stupid simple text region. Passing messages around was playing in very expensive traffic at the end of the day.
This is ultimately about information and spinning up an entire matrix of "agents" to process stream of info from A to B seems quite wasteful when many clear alternatives exist.
If we are seeking emergence, then perhaps this mental model still fits better. But, for practical targeted solutions, I think it's a huge distraction.