Hacker News new | past | comments | ask | show | jobs | submit login
AI agents but they're working in big tech (alexsima.substack.com)
66 points by alsima 5 months ago | hide | past | favorite | 55 comments



"Research shows that AI systems with 30+ agents out-performs a simple LLM call in practically any task (see More Agents Is All You Need), reducing hallucinations and improving accuracy."

Has anyone heard of that actually playing out practically in real-world applications? This article links to the paper about it - https://arxiv.org/abs/2402.05120 - but I've not heard from anyone who's implementing production systems successfully in that way.

(I still don't actually know what an "agent" is, to be honest. I'm pretty sure there are dozens of conflicting definitions floating around out there by now.)


I don't even buy that the linked paper justifies the claim. All the paper does is draw multiple samples from an LLM and take the majority vote. They do try integrating their majority vote algorithm with an existing multi-agent system, but it usually performs worse than just straightforwardly asking the model multiple times (see Table 3). I don't understand how the author of this article can make that claim, nor why they believe the linked paper supports it. It does, however, support my prior that LLM agents are snake oil.


Looking at Table 3: "Our [sampling and voting] method outperforms other methods used standalone in most cases and always enhances other methods across various tasks and LLMs", which benchmark did the majority vote algorithm perform worse in?


I'm not saying that sampling and majority voting performed worse. I'm saying that multi-agent interaction (labeled Debate and Reflection) performed worse than straightforward approaches that just query multiple times. For example, the Debate method combined with their voting mechanism gets 0.48 GSM8K with Llama2-13B. But majority voting with no multi-agent component (Table 2) gets 0.59 on the same setting. And majority voting with Chain-of-thought (Table 3 CoT/ZS-CoT) does even better.

Fundamentally, drawing multiple independent samples from an LLM is not "AI Agents". The only rows on those tables that are AI Agents are Debate and Reflection. They provide marginal improvements on only a few task/model combinations, and do so at a hefty increase in computational cost. In many tasks, they are significantly behind simpler and cheaper alternatives.


I see. Agree with the point about marginal improvements at a hefty increase in computational cost (I touch on this a bit at the end of the blog post where I mention that better performance requires better tooling/base models). Though I would still consider sampling and voting a “multi-agent” framework as it’s still performing an aggregation over multiple results.


My take: the agents are bad, the ensembling approach is good.


Ensembling is okay, but it doesn't scale great. Like you can put in a bit more compute to draw more samples and get a better result, but you can't keep doing it without hitting a plateau, because eventually the model's majority vote stabilizes. So it's a perfectly reasonable thing to do, as long as we temper our expectations.


Potentially true as well haha


>Has anyone heard of that actually playing out practically in real-world applications?

Yes. I've build tools which do this.

Unfortunately the price tag for building a system that makes agent swarms dynamic is too much for anyone to bear in the current market.


Originally term "agency" was used to indicate inability of computers to act autonomously, intentionally or with purpose.

Then few developers published joke projects of automated browsers that pointlessly doomscroll random webpages saying "here are your agents".

Then few companies transformed this joke into "rag" agents - but there automated browsers don't act autonomously, they plagiarize web content according to prompt.

Then many websites started to block rag agents, and to hide it companies fell back to return data from LLMs, occasionally updating responses in the background (aka "online models").

The idea of plagiarizing content is also mixed with another idea: if LLM rewrites plagiarized content multiple times, it becomes harder to proof.

Obviously, none of involved companies will admit to plagiarizing. Instead, they will cover themselves with the idea that it leads to superintelligence. For example, if multiple neural networks repeat that 9.11 > 9.9, it will be considered more accurate[1].

[1] https://www.reddit.com/r/singularity/comments/1e4fcxm/none_o...


Thirty recursive loops of LLMs perform better than one prompt? I should hope so!

But how much power does it need?


A lot...as you might imagine the costs of running the whole organization scale immensely.


I love this paper though I think their use of "agent" is confusing. My takeaway is ensembling LLM queries is more effective than making a single query. We used a similar approach at work and got a ~30% increase in [performance metric]. It was also cost efficient relative to the ROI increase that came with the performance gains. Simpler systems only require the number of output generations to increase, meaning input costs stay the same regardless of the size of the ensemble.


Thanks for this paper! Still early so not quite production, but I’ve seen positive results on tasks scaling the number of “agents”. I more or less think of agents as a task focused prompt and these agents are slight variations of the task. It makes me think I’m running some kind of Monte Carlo simulation.


"agent" is a buzzword. All it is, is a bunch of LLM calls in a while loop.

- 'rag' is meaningless as a concept. imagine calling a web app 'database augmented programming'. [1]

- 'agent' probably just means 'run an llm in a loop' [1]

[1] https://x.com/atroyn/status/1819396701217870102


> All it is, is a bunch of LLM calls in a while loop.

Consciousness may be such a loop. Look outward, look at your response, repeat. Like the illusion of motion created by frames per second.

To me it is concerning that proper useful agents may emerge from such a simple loop. E.g just repeatedly asking a sufficiently intelligent AI, "what action within my power shall I perform next in order to maximize the mass of paperclips in the universe," interleaved by performing that act and recording the result.


There is a lot more to agents than 'change the system prompt and send it again to the same model'.

It's just that the tools to do this are non-existent and each application needs to be bespoke.

It's like programming without an OS.


honestly agree. When I first started working with agents I didnt fully understand what it really was either but I eventually fell on a definition of an LLM call that performs a unique function proactively ¯\_(ツ)_/¯.


I bet they spend 95% of their time in agile refinement meetings.


I spent the last year engineering to the point I could try this and it was ___massively___ disappointing in practice. Massively. Shockingly.

The answers from sampling every big model, then having one distill, were not noticeably better than just from Gpt-4o or Claude Sonnet, and the UX is so much worse (2x wait) that I tabled it for now.

I assumed it would be obviously good, even just from first principles.

I didn't do my usual full med/law benchmarks because given what we saw from a small sample, only 8 questions, I can skip adding it and proceed down the TODO list for launch.

I've also done the inverse, reproduced better results on med with gpt4o x one round RAG x one answer, than Google's Gemini Med insanely complex "finetune our biggest model on med, then do 2 round RAG with 5 answers + a vote on each round, and an opportunity to fetch new documents in round 2". We're both near-saturation, but I got 95% on my random sample of 100 Qs, Med Gemini was 93%.


I would check out this company, Swarms (https://github.com/kyegomez/swarms) who's working with enterprises to integrate multi-agents. But definitely a great point to focus on, the research paper mentions that the scaling of performance reduces with complexity of the task, which is definitely true for SWE


I totally believe that people are selling solutions around this idea, what I'd like to hear is genuine success stories from people who have used them in production (and aren't currently employed by a vendor).


Midjourney uses multiple agents for determining if a prompt is appropriate or not.

I kinda did this, too.

I made a 3 agent system — one is a router that parses the request and determines where to send it (to the other 2) one is a chat agent and the third is an image generator.

If the router determines an image is requested, the chat agent is tasked with making a caption to go along with the image.

It works well enough.


Midjourney does not use an agent system, they use a single call.


I remembered reading that they used > 1, found this screengrab of discord on Reddit:

https://www.reddit.com/r/midjourney/comments/137bj1o/new_upd...


It's still one call they just use a different llm now.


Is that multiple agents, or just multiple prompts? Are agents and prompts the same thing?


I am fairly certain that agents are the same thing as prompts (but also could be different).

Only the chat prompt/agent/whatever is connected to RAG; the image generator is DALLE and the router is a one-off call each time.

eg, it could be the same model with a different prompt or a different model + different prompt all together. AFAIU it’s just serving a different purpose than the other calls


Depends on who is trying to sell you what.

Currently all the tools on the market just use a different prompt and call it an agent.

I've build a tool using different models for each agent, e.g. whisper for audio decoding, llava to detect slides on the screen, open cv to crop the image, ocr to read the slide content, llm to summarize everything that's happening during the earnings call.


How do you jailbreak midjourney?


Hmmm I get what you mean...I think it's hard to sell a solution around this idea, but I think it will become something more like a common practice/performance improvement method. James Huckle on Linkedin (https://www.linkedin.com/feed/update/urn:li:activity:7214295...) mentioned that agent communication would be something more like a hyperparameter to tune which I agree with.


Try it for classification tasks.


It's interesting how long the word "agents"/"intelligent agents" have been around for and how long they've been hyped up for. If you go back to the 80s and 90s you will see how Microsoft was hyping up "intelligent agents" in Windows but nothing ever became of it[1].

I have yet to see an actual useful usecase for agents despite the countless posts asking for examples nobody has provided one.

[1] https://www.wired.com/1995/09/future-forward/


It was a common theme in Apple WWDC conferences in the late 80's. Alan Kay has an interesting talk about agents.

I think it could be argued that the low hanging fruit aspect of agents where fulfilled by microservices and web based businesses. The concept of webpages populated with relevant data like google search pages, or Amazon populating products you'd like, could be called agent based. Netflix could be an example of an agent based service.


Or CORBA based intelligent agents in the 1990s


Clippy?


I mean they are truly the dream of every capitalist. Intelligent agents basically mean free money. Of course this is something Microsoft would be talking about for decades.


Is there an agent framework that lives up to the hype?

Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?


> Where you specify a top-level objective, it plans out those objectives, it selects a completion metric so that it knows when to finish, and iterates/reiterates over the output until completion?

I built Plandex[1], which works roughly like this. The goal (so far) is not to take you from an initial prompt to a 100% working solution in one go, but to provide tools that help you iterate your way to a 90-95% solution. You can then fill in the gaps yourself.

I think the idea of a fully autonomous AI engineer is currently mostly hype. Making that the target is good for marketing, but in practice it leads to lots of useless tire-spinning and wasted tokens. It's not a good idea, for example, to have the LLM try to debug its own output by default. It might, on a case-by-case basis, be a good idea to feed an error back to the LLM, but just as often it will be faster for the developer to do the debugging themselves.

1 - https://plandex.ai


This looked very promising.

Although, it's now prompting me to make an account when I issue `plandex new`?

None of the video demos show this requirement.

I think the demos should show this requirement. Or the Quickstart docs should directly link to the self-hosted instructions.

"? Hey there! It looks like this is your first time using Plandex on this computer.

What would you like to do?

> Start an anonymous trial on Plandex Cloud (no email required)

  Sign in, accept an invite, or create an account"****


Thanks for the feedback. The cloud option is offered as a way to get started as quickly as possible, but self-hosting is straightforward too: https://docs.plandex.ai/hosting/self-hosting


Plandex and Github's Copilot Workspace are very similar. You could draw inspiration from each other.

I tweeted about Copilot Workspace the other day.

https://twitter.com/aantix/status/1819794837375263228


To this date, ChatGPT Code Interpreter is still the most impressive implementation of this pattern that I've seen.

Give it a task, it writes code, runs the code, gets errors, fixes bugs, tries again generally until it succeeds.

That's over a year old at this point, and it's not clear to me if it counts as an "agent" by many people's definitions (which are often frustratingly vague).


Well, you have Cognition AI and Devin that became a recent unicorn startup (partnerships with Microsoft and stuff) but true, I can't think of an agent that actually lives up to the hype (heard Devin wasn't great).


Can I just ask whether other people think that "agentic" is a word?

As far as I can tell it's not in the OED or Miriam Webster dictionaries. But recently everyone's using it so perhaps it soon will be.


That's how words form


"Agentic" is a term of art from psychology that's diffused into common usage. It dates back to the 1970s, primarily associated with Albert Bandura, the guy behind the Bobo doll experiment.

From ChatGPT: Other examples include "heuristic," "cognitive dissonance," "meta-cognition," "self-actualization," "self-efficacy," "locus of control," and "archetype."


Thanks. Interesting! I see "Agentic state" is one where an "individual perceives themselves as an agent of the authority figure and is willing to carry out their commands, even if it goes against their own moral code". That's ironic as most LLMs have such strong safety training that it's almost impossible to get them to enter such a state.


In the case of LLMs, I think their training is the supreme authority.


Seems to have been for a while.

https://en.wiktionary.org/wiki/agentic


It's the new cerebral valley slang dude


If we structured AI agents like big tech org charts, which company structures would perform better? Inspired by James Huckle's thoughts on how organizational structures impact software design, I decided to put this to the test: https://bit.ly/ai-corp-agents.


- Big tech is very different from open source

- The original SWE-bench paper only consists of solved issues when a big part of Open Source is triage, follow-up, clarification and dealing with crappy issues

- Saying "<Technique> is all you need" when you are increasing your energy usage 30-fold just to fail > 50% of the time is intellectually dishonest


Definitely not saying multi-agents is all you need for SWE-bench haha. I touch on this at the end of the blog post, where I mention jumps in progress require better base models or tooling.


Now give the agents stacked ranking and see how they converge to low performance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: