Hacker News new | past | comments | ask | show | jobs | submit login

I don't even buy that the linked paper justifies the claim. All the paper does is draw multiple samples from an LLM and take the majority vote. They do try integrating their majority vote algorithm with an existing multi-agent system, but it usually performs worse than just straightforwardly asking the model multiple times (see Table 3). I don't understand how the author of this article can make that claim, nor why they believe the linked paper supports it. It does, however, support my prior that LLM agents are snake oil.



Looking at Table 3: "Our [sampling and voting] method outperforms other methods used standalone in most cases and always enhances other methods across various tasks and LLMs", which benchmark did the majority vote algorithm perform worse in?


I'm not saying that sampling and majority voting performed worse. I'm saying that multi-agent interaction (labeled Debate and Reflection) performed worse than straightforward approaches that just query multiple times. For example, the Debate method combined with their voting mechanism gets 0.48 GSM8K with Llama2-13B. But majority voting with no multi-agent component (Table 2) gets 0.59 on the same setting. And majority voting with Chain-of-thought (Table 3 CoT/ZS-CoT) does even better.

Fundamentally, drawing multiple independent samples from an LLM is not "AI Agents". The only rows on those tables that are AI Agents are Debate and Reflection. They provide marginal improvements on only a few task/model combinations, and do so at a hefty increase in computational cost. In many tasks, they are significantly behind simpler and cheaper alternatives.


I see. Agree with the point about marginal improvements at a hefty increase in computational cost (I touch on this a bit at the end of the blog post where I mention that better performance requires better tooling/base models). Though I would still consider sampling and voting a “multi-agent” framework as it’s still performing an aggregation over multiple results.


My take: the agents are bad, the ensembling approach is good.


Ensembling is okay, but it doesn't scale great. Like you can put in a bit more compute to draw more samples and get a better result, but you can't keep doing it without hitting a plateau, because eventually the model's majority vote stabilizes. So it's a perfectly reasonable thing to do, as long as we temper our expectations.


Potentially true as well haha




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: