"Research shows that AI systems with 30+ agents out-performs a simple LLM call in practically any task (see More Agents Is All You Need), reducing hallucinations and improving accuracy."
Has anyone heard of that actually playing out practically in real-world applications? This article links to the paper about it - https://arxiv.org/abs/2402.05120 - but I've not heard from anyone who's implementing production systems successfully in that way.
(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)
I don't even buy that the linked paper justifies the claim. All the paper does is draw multiple samples from an LLM and take the majority vote. They do try integrating their majority vote algorithm with an existing multi-agent system, but it usually performs worse than just straightforwardly asking the model multiple times (see Table 3). I don't understand how the author of this article can make that claim, nor why they believe the linked paper supports it. It does, however, support my prior that LLM agents are snake oil.
Looking at Table 3: "Our [sampling and voting] method outperforms other methods used standalone in most cases and always enhances other methods across various tasks and LLMs", which benchmark did the majority vote algorithm perform worse in?
I'm not saying that sampling and majority voting performed worse. I'm saying that multi-agent interaction (labeled Debate and Reflection) performed worse than straightforward approaches that just query multiple times. For example, the Debate method combined with their voting mechanism gets 0.48 GSM8K with Llama2-13B. But majority voting with no multi-agent component (Table 2) gets 0.59 on the same setting. And majority voting with Chain-of-thought (Table 3 CoT/ZS-CoT) does even better.
Fundamentally, drawing multiple independent samples from an LLM is not "AI Agents". The only rows on those tables that are AI Agents are Debate and Reflection. They provide marginal improvements on only a few task/model combinations, and do so at a hefty increase in computational cost. In many tasks, they are significantly behind simpler and cheaper alternatives.
I see. Agree with the point about marginal improvements at a hefty increase in computational cost (I touch on this a bit at the end of the blog post where I mention that better performance requires better tooling/base models). Though I would still consider sampling and voting a “multi-agent” framework as it’s still performing an aggregation over multiple results.
Ensembling is okay, but it doesn't scale great. Like you can put in a bit more compute to draw more samples and get a better result, but you can't keep doing it without hitting a plateau, because eventually the model's majority vote stabilizes. So it's a perfectly reasonable thing to do, as long as we temper our expectations.
Originally term "agency" was used to indicate inability of computers to act autonomously, intentionally or with purpose.
Then few developers published joke projects of automated browsers that pointlessly doomscroll random webpages saying "here are your agents".
Then few companies transformed this joke into "rag" agents - but there automated browsers don't act autonomously, they plagiarize web content according to prompt.
Then many websites started to block rag agents, and to hide it companies fell back to return data from LLMs, occasionally updating responses in the background (aka "online models").
The idea of plagiarizing content is also mixed with another idea: if LLM rewrites plagiarized content multiple times, it becomes harder to proof.
Obviously, none of involved companies will admit to plagiarizing. Instead, they will cover themselves with the idea that it leads to superintelligence. For example, if multiple neural networks repeat that 9.11 > 9.9, it will be considered more accurate[1].
I love this paper though I think their use of "agent" is confusing. My takeaway is ensembling LLM queries is more effective than making a single query. We used a similar approach at work and got a ~30% increase in [performance metric]. It was also cost efficient relative to the ROI increase that came with the performance gains. Simpler systems only require the number of output generations to increase, meaning input costs stay the same regardless of the size of the ensemble.
Thanks for this paper! Still early so not quite production, but I’ve seen positive results on tasks scaling the number of “agents”. I more or less think of agents as a task focused prompt and these agents are slight variations of the task. It makes me think I’m running some kind of Monte Carlo simulation.
> All it is, is a bunch of LLM calls in a while loop.
Consciousness may be such a loop. Look outward, look at your response, repeat. Like the illusion of motion created by frames per second.
To me it is concerning that proper useful agents may emerge from such a simple loop. E.g just repeatedly asking a sufficiently intelligent AI, "what action within my power shall I perform next in order to maximize the mass of paperclips in the universe," interleaved by performing that act and recording the result.
honestly agree. When I first started working with agents I didnt fully understand what it really was either but I eventually fell on a definition of an LLM call that performs a unique function proactively ¯\_(ツ)_/¯.
I spent the last year engineering to the point I could try this and it was ___massively___ disappointing in practice. Massively. Shockingly.
The answers from sampling every big model, then having one distill, were not noticeably better than just from Gpt-4o or Claude Sonnet, and the UX is so much worse (2x wait) that I tabled it for now.
I assumed it would be obviously good, even just from first principles.
I didn't do my usual full med/law benchmarks because given what we saw from a small sample, only 8 questions, I can skip adding it and proceed down the TODO list for launch.
I've also done the inverse, reproduced better results on med with gpt4o x one round RAG x one answer, than Google's Gemini Med insanely complex "finetune our biggest model on med, then do 2 round RAG with 5 answers + a vote on each round, and an opportunity to fetch new documents in round 2". We're both near-saturation, but I got 95% on my random sample of 100 Qs, Med Gemini was 93%.
I would check out this company, Swarms (https://github.com/kyegomez/swarms) who's working with enterprises to integrate multi-agents. But definitely a great point to focus on, the research paper mentions that the scaling of performance reduces with complexity of the task, which is definitely true for SWE
I totally believe that people are selling solutions around this idea, what I'd like to hear is genuine success stories from people who have used them in production (and aren't currently employed by a vendor).
Midjourney uses multiple agents for determining if a prompt is appropriate or not.
I kinda did this, too.
I made a 3 agent system — one is a router that parses the request and determines where to send it (to the other 2) one is a chat agent and the third is an image generator.
If the router determines an image is requested, the chat agent is tasked with making a caption to go along with the image.
I am fairly certain that agents are the same thing as prompts (but also could be different).
Only the chat prompt/agent/whatever is connected to RAG; the image generator is DALLE and the router is a one-off call each time.
eg, it could be the same model with a different prompt or a different model + different prompt all together. AFAIU it’s just serving a different purpose than the other calls
Currently all the tools on the market just use a different prompt and call it an agent.
I've build a tool using different models for each agent, e.g. whisper for audio decoding, llava to detect slides on the screen, open cv to crop the image, ocr to read the slide content, llm to summarize everything that's happening during the earnings call.
Hmmm I get what you mean...I think it's hard to sell a solution around this idea, but I think it will become something more like a common practice/performance improvement method. James Huckle on Linkedin (https://www.linkedin.com/feed/update/urn:li:activity:7214295...) mentioned that agent communication would be something more like a hyperparameter to tune which I agree with.
It's interesting how long the word "agents"/"intelligent agents" have been around for and how long they've been hyped up for. If you go back to the 80s and 90s you will see how Microsoft was hyping up "intelligent agents" in Windows but nothing ever became of it[1].
I have yet to see an actual useful usecase for agents despite the countless posts asking for examples nobody has provided one.
It was a common theme in Apple WWDC conferences in the late 80's. Alan Kay has an interesting talk about agents.
I think it could be argued that the low hanging fruit aspect of agents where fulfilled by microservices and web based businesses. The concept of webpages populated with relevant data like google search pages, or Amazon populating products you'd like, could be called agent based. Netflix could be an example of an agent based service.
I mean they are truly the dream of every capitalist. Intelligent agents basically mean free money. Of course this is something Microsoft would be talking about for decades.
Is there an agent framework that lives up to the hype?
Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?
> Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?
I built Plandex[1], which works roughly like this. The goal (so far) is not to take you from an initial prompt to a 100% working solution in one go, but to provide tools that help you iterate your way to a 90-95% solution. You can then fill in the gaps yourself.
I think the idea of a fully autonomous AI engineer is currently mostly hype. Making that the target is good for marketing, but in practice it leads to lots of useless tire-spinning and wasted tokens. It's not a good idea, for example, to have the LLM try to debug its own output by default. It might, on a case-by-case basis, be a good idea to feed an error back to the LLM, but just as often it will be faster for the developer to do the debugging themselves.
Thanks for the feedback. The cloud option is offered as a way to get started as quickly as possible, but self-hosting is straightforward too: https://docs.plandex.ai/hosting/self-hosting
To this date, ChatGPT Code Interpreter is still the most impressive implementation of this pattern that I've seen.
Give it a task, it writes code, runs the code, gets errors, fixes bugs, tries again generally until it succeeds.
That's over a year old at this point, and it's not clear to me if it counts as an "agent" by many people's definitions (which are often frustratingly vague).
Well, you have Cognition AI and Devin that became a recent unicorn startup (partnerships with Microsoft and stuff) but true, I can't think of an agent that actually lives up to the hype (heard Devin wasn't great).
"Agentic" is a term of art from psychology that's diffused into common usage. It dates back to the 1970s, primarily associated with Albert Bandura, the guy behind the Bobo doll experiment.
From ChatGPT: Other examples include "heuristic," "cognitive dissonance," "meta-cognition," "self-actualization," "self-efficacy," "locus of control," and "archetype."
Thanks. Interesting! I see "Agentic state" is one where an "individual perceives themselves as an agent of the authority figure and is willing to carry out their commands, even if it goes against their own moral code". That's ironic as most LLMs have such strong safety training that it's almost impossible to get them to enter such a state.
If we structured AI agents like big tech org charts, which company structures would perform better? Inspired by James Huckle's thoughts on how organizational structures impact software design, I decided to put this to the test: https://bit.ly/ai-corp-agents.
- The original SWE-bench paper only consists of solved issues when a big part of Open Source is triage, follow-up, clarification and dealing with crappy issues
- Saying "<Technique> is all you need" when you are increasing your energy usage 30-fold just to fail > 50% of the time is intellectually dishonest
Definitely not saying multi-agents is all you need for SWE-bench haha. I touch on this at the end of the blog post, where I mention jumps in progress require better base models or tooling.
Has anyone heard of that actually playing out practically in real-world applications? This article links to the paper about it - https://arxiv.org/abs/2402.05120 - but I've not heard from anyone who's implementing production systems successfully in that way.
(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)