"Research shows that AI systems with 30+ agents out-performs a simple LLM call in practically any task (see More Agents Is All You Need), reducing hallucinations and improving accuracy."
Has anyone heard of that actually playing out practically in real-world applications? This article links to the paper about it - https://arxiv.org/abs/2402.05120 - but I've not heard from anyone who's implementing production systems successfully in that way.
(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)
I don't even buy that the linked paper justifies the claim. All the paper does is draw multiple samples from an LLM and take the majority vote. They do try integrating their majority vote algorithm with an existing multi-agent system, but it usually performs worse than just straightforwardly asking the model multiple times (see Table 3). I don't understand how the author of this article can make that claim, nor why they believe the linked paper supports it. It does, however, support my prior that LLM agents are snake oil.
Looking at Table 3: "Our [sampling and voting] method outperforms other methods used standalone in most cases and always enhances other methods across various tasks and LLMs", which benchmark did the majority vote algorithm perform worse in?
I'm not saying that sampling and majority voting performed worse. I'm saying that multi-agent interaction (labeled Debate and Reflection) performed worse than straightforward approaches that just query multiple times. For example, the Debate method combined with their voting mechanism gets 0.48 GSM8K with Llama2-13B. But majority voting with no multi-agent component (Table 2) gets 0.59 on the same setting. And majority voting with Chain-of-thought (Table 3 CoT/ZS-CoT) does even better.
Fundamentally, drawing multiple independent samples from an LLM is not "AI Agents". The only rows on those tables that are AI Agents are Debate and Reflection. They provide marginal improvements on only a few task/model combinations, and do so at a hefty increase in computational cost. In many tasks, they are significantly behind simpler and cheaper alternatives.
I see. Agree with the point about marginal improvements at a hefty increase in computational cost (I touch on this a bit at the end of the blog post where I mention that better performance requires better tooling/base models). Though I would still consider sampling and voting a “multi-agent” framework as it’s still performing an aggregation over multiple results.
Ensembling is okay, but it doesn't scale great. Like you can put in a bit more compute to draw more samples and get a better result, but you can't keep doing it without hitting a plateau, because eventually the model's majority vote stabilizes. So it's a perfectly reasonable thing to do, as long as we temper our expectations.
Originally term "agency" was used to indicate inability of computers to act autonomously, intentionally or with purpose.
Then few developers published joke projects of automated browsers that pointlessly doomscroll random webpages saying "here are your agents".
Then few companies transformed this joke into "rag" agents - but there automated browsers don't act autonomously, they plagiarize web content according to prompt.
Then many websites started to block rag agents, and to hide it companies fell back to return data from LLMs, occasionally updating responses in the background (aka "online models").
The idea of plagiarizing content is also mixed with another idea: if LLM rewrites plagiarized content multiple times, it becomes harder to proof.
Obviously, none of involved companies will admit to plagiarizing. Instead, they will cover themselves with the idea that it leads to superintelligence. For example, if multiple neural networks repeat that 9.11 > 9.9, it will be considered more accurate[1].
I love this paper though I think their use of "agent" is confusing. My takeaway is ensembling LLM queries is more effective than making a single query. We used a similar approach at work and got a ~30% increase in [performance metric]. It was also cost efficient relative to the ROI increase that came with the performance gains. Simpler systems only require the number of output generations to increase, meaning input costs stay the same regardless of the size of the ensemble.
Thanks for this paper! Still early so not quite production, but I’ve seen positive results on tasks scaling the number of “agents”. I more or less think of agents as a task focused prompt and these agents are slight variations of the task. It makes me think I’m running some kind of Monte Carlo simulation.
> All it is, is a bunch of LLM calls in a while loop.
Consciousness may be such a loop. Look outward, look at your response, repeat. Like the illusion of motion created by frames per second.
To me it is concerning that proper useful agents may emerge from such a simple loop. E.g just repeatedly asking a sufficiently intelligent AI, "what action within my power shall I perform next in order to maximize the mass of paperclips in the universe," interleaved by performing that act and recording the result.
honestly agree. When I first started working with agents I didnt fully understand what it really was either but I eventually fell on a definition of an LLM call that performs a unique function proactively ¯\_(ツ)_/¯.
I spent the last year engineering to the point I could try this and it was ___massively___ disappointing in practice. Massively. Shockingly.
The answers from sampling every big model, then having one distill, were not noticeably better than just from Gpt-4o or Claude Sonnet, and the UX is so much worse (2x wait) that I tabled it for now.
I assumed it would be obviously good, even just from first principles.
I didn't do my usual full med/law benchmarks because given what we saw from a small sample, only 8 questions, I can skip adding it and proceed down the TODO list for launch.
I've also done the inverse, reproduced better results on med with gpt4o x one round RAG x one answer, than Google's Gemini Med insanely complex "finetune our biggest model on med, then do 2 round RAG with 5 answers + a vote on each round, and an opportunity to fetch new documents in round 2". We're both near-saturation, but I got 95% on my random sample of 100 Qs, Med Gemini was 93%.
I would check out this company, Swarms (https://github.com/kyegomez/swarms) who's working with enterprises to integrate multi-agents. But definitely a great point to focus on, the research paper mentions that the scaling of performance reduces with complexity of the task, which is definitely true for SWE
I totally believe that people are selling solutions around this idea, what I'd like to hear is genuine success stories from people who have used them in production (and aren't currently employed by a vendor).
Midjourney uses multiple agents for determining if a prompt is appropriate or not.
I kinda did this, too.
I made a 3 agent system — one is a router that parses the request and determines where to send it (to the other 2) one is a chat agent and the third is an image generator.
If the router determines an image is requested, the chat agent is tasked with making a caption to go along with the image.
I am fairly certain that agents are the same thing as prompts (but also could be different).
Only the chat prompt/agent/whatever is connected to RAG; the image generator is DALLE and the router is a one-off call each time.
eg, it could be the same model with a different prompt or a different model + different prompt all together. AFAIU it’s just serving a different purpose than the other calls
Currently all the tools on the market just use a different prompt and call it an agent.
I've build a tool using different models for each agent, e.g. whisper for audio decoding, llava to detect slides on the screen, open cv to crop the image, ocr to read the slide content, llm to summarize everything that's happening during the earnings call.
Hmmm I get what you mean...I think it's hard to sell a solution around this idea, but I think it will become something more like a common practice/performance improvement method. James Huckle on Linkedin (https://www.linkedin.com/feed/update/urn:li:activity:7214295...) mentioned that agent communication would be something more like a hyperparameter to tune which I agree with.
Has anyone heard of that actually playing out practically in real-world applications? This article links to the paper about it - https://arxiv.org/abs/2402.05120 - but I've not heard from anyone who's implementing production systems successfully in that way.
(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)