Hacker Newsnew | past | comments | ask | show | jobs | submit | gronky_'s commentslogin

I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.

71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.

But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.

Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.

If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.


This is classic Goodhart's law. "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law


It's really not that hard to not build a custom bench setup to game the benchmark instead of just using your product straight out of the box, though.


Right, other than financial pressure. Which is, of course, immense.


Right. Building a custom setup is blatant- that will wildly overfit.

But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.

This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.


Also see the VW dieselgate and numerous other "gaming the system" examples.


A specific setup for the benchmark is just plain cheating, not Goodhart’s law.


What are the typical context lengths in SWE-bench problems? Does it partly measure performance in the 64-128k context range?


This is what the rows look like:

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...

Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:

https://x.com/brhydon/status/1953648884309536958


IIRC the SWE bench dataset gives you the full repo snapshot + the issue text, the evaluation pipelines typically run some kind of retriever (eg. grep, BM25) to pick a subset of files to place in the model’s context. They provided context is usually limited up to ~50k tokens.


Is there something in this multi-agent approach that makes the setup more specific to just the test at hand and less general to real engineering tasks? If not, then this multi-agent system will just become what you get out of the box in a future product. Multiple attempts per problem (as long as there's no human intervention or selection between them) is a perfectly fine approach for agents because that's not an issue from the perspective of an engineer using the product. A single agent is already a multi-step usage of LLMs and it sounds like this is just another meta level of that.



I think multiple attempts are completely understandable and even expected? How is that defeating the purpose of the benchmark?


It’s a pass@1 benchmark. When submitting you need to check a box that there was only 1 attempt per problem. See here for example: https://github.com/SWE-bench/experiments/pull/219

Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable


From my perspective as a potential user the number of attempts is the number of times I have to tell it what to do. If you have an agent that makes a single attempt and is 60% accurate vs another that makes 5 attempts and is 80% accurate, why would you care that each individual attempt of the 2nd model is less accurate than the first?


I think it depends on "But the top rated submissions aren’t running production products" It sounds like they're shipping a product without the debug agent/try-again logic, and that's just for the benchmark, so you wouldn't get the performance they get as a user.


I was thinking about this recently with respect to how many agent systems now let you specify a smaller/faster model for easier tasks and a bigger model for harder tasks.

It's interesting to think about what the trade-offs are. Assuming the system can properly classify a task as easy or hard (big "if" but I guess there are ways), there is nonetheless more to think about, depending on your pricing plan.

For subscription pricing, I guess you don't really care which model runs and in fact it's hard to find a reason to ever run the smaller model, so choosing between the models is more in the provider's interests for cost efficiency.

But for pay-per-use pricing, But if you have a bigger model that can get the answer right 80% of the time, and a smaller model that can handle smaller changes and get things right 60% of the time but correct its mistakes, then the system should try to run it on as many tasks as possible to save you money.. but in the end if ends up having to make a lot of corrections, then maybe you end up needing more total requests than the larger model. In that case maybe it's actually cheaper to run the larger model, if it takes fewer requests.

So I wonder how that kind of trade-off could be effectively calculated. I guess if you can figure out when "retries" happen you can count them and do some statistics on which model is more likely to work out in fewer shots. It's pretty complicated though, when you start to think about it in detail.

I do wonder if even having BOTH the smaller and bigger model make hypotheses, and try the smaller model's idea first, then if it fails, try the bigger model's idea, might be the way to go.


This ok from your perspective then?

def make_pass@1_agent(agent, n):

    def retry_agent(problem):

        for attempt in range(n):

            result = agent(problem)

            if result.success:

                return result

        return result

    return retry_agent


Definitely wouldn't have written the code that way, but yes, if (and this is a massive "if") the agent has an accurate and meaningful way to determine which way to set the success boolean. The obvious caveat would be if n needed to be large enough to set the costs higher than I am willing to pay for the additional performance or it makes it take longer than I'm willing to wait.

Think of the agent like an employee. If he delivers the code within the expected time and to the expected quality standards, his process of getting there means almost nothing. Do I care if he tried 4 different approaches along the way and threw out the first 3? Not a bit.


Absolutely fine, as long as the success flag is predicted by the model ensemble under test. That’s how Claude Code works for example, it will continue to iterate until success (or it will give up with failure at a certain point).


Keep in mind that this isn’t about users - the top agents on the leaderboard aren’t running an actual product on the benchmark.

If they are running their production product as is, then of course whatever is built into the product is fine.


Papers have been doing rollouts that involve a model proposing N solutions and then self-reviewing to choose the best one (prior to the verifier). So far, I think that's been counted as one pass.


According to your experience with this model, is it just trained for the benchmark or these points are actually representing the performance?


Finally someone mentions Refact, I was in contact with the team, rooting for them really.


Just looked them up. Their pricing is around buying "coins" with no transparency as to what that gets. Hard pass


You realize that you can self-host their stuff? https://github.com/smallcloudai/refact


One thing with SWE bench is making sure there's zero leakage of information into the LLM context.

I.e. the agent cannot even know which tests are failing.

It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.

For this reason I find the benchmark a little disconnected from the reality of software engineering.


I see it a bit differently - LLMs are an incredible innovation but it’s hard to do anything useful with them without the right wrapper.

A good wrapper has deep domain knowledge baked into it, combined with automation and expert use of the LLM.

It maybe isn’t super innovative but it’s a bit of an art form and unlocks the utility of the underlying LLM


Exactly.

To present a potential usecase: there's a ridiculous and massive backlog in the Indian judicial system. LLMs can be let loose on the entire workflow: triage cases (simple, complicated, intractable, grouped by legal principles or parties), pull up related caselaw, provide recommendations, throw more LLMs and more reasoning at unclear problems. Now you can't do this with just a desktop and chatgpt, you need a systemic pipeline of LLM-driven workflows, but doing that unlocks potentially billions of dollars of value that is otherwise elusive.


>doing that unlocks potentially billions of dollars of value that is otherwise elusive

What's more, it unlocks potentially new additions to the 206 legal cases where generative AI produced hallucinated (fake) content.

https://www.damiencharlotin.com/hallucinations/


>pull up related caselaw

Or just make some up...


At the token layer an LLM can make things up, but not as part of a structured pipeline that validates an invariant that all suggestions are valid entities in the database.

Can google search hallucinate webpages?


How is something that cant admit it doesnt know, and hallucinates a good innovation?


Modern LLMs frequently do state that they "don't know", for what it's worth. Like everything, it highly depends on the question.


It will catch those sneaky bugs


Stating that Israel doesn’t have a right to exist has been recognized to be an antisemitic statement by many prominent institutions.

It’s a radical statement that effectively denies the rights of millions of people to exist and is especially problematic given the historical context of the establishment of Israel.

The statement gets thrown around so much in certain circles that it’s gotten normalized. You’ve apparently lost sight of or never stopped to think what actually means, to the point where you’re providing it as an example of an innocent statement that got you banned for no reason. Taking this statement out of radical activist circles and into the real world won’t go well.

Take some time to educate yourself and reflect on what it actually means.


But states don't have a right to exist. It's not a right a state can have.

The context of the establishment of Israel is also the mass expulsion of Palestinians, terrorism, the murder of British soldiers trying to keep peace, biological warfare with the poisoning of wells with typhus, etc.


Why is it ok to say Palestine doesn’t have a right to exist, but not to say Israel doesn’t have a right?


[flagged]


What does that mean, 'a right to exist'? It sounds like emotional nonsense speak - we have the powerful words right and exist, and together they sound existential, for a person.

In international law afaict existing states have a right to exist - yes, almost circular. It's not endorsement of them as good or ok; it's recognizing reality and it's considered absolutely essential to maintain peace - otherwise everyone could attack almost anyone else, because if you're going to start deligitimizing states based on their bad actions, including in their formation, there's going to be a long list.

But beyond that, I don't even know what it means for state. Beyond any doubt, the humans in Israel and the occupied territories all have a right to exist.


After the fall of the Nazi regime, two new German governments were installed, and then unification happened. Yugoslavia broke up. The Soviet Union broke up. States alter, abolish, and replace themselves all the time.

Maybe the dissolution I'm hoping for will actually take the form of Israel codifying a constitution finally that grants equal rights to Arab Palestinians and Jewish Israelis.

>Beyond any doubt, the humans in Israel and the occupied territories all have a right to exist.

I certainly agree here.


> After the fall of the Nazi regime, two new German governments were installed, and then unification happened. Yugoslavia broke up. The Soviet Union broke up. States alter, abolish, and replace themselves all the time.

All the time? You had to go back to the 1940s and 1990s to find examples. Per Wikipedia, though I wouldn't trust it completely, there have been only three new countries since 1994 - South Sudan, Kosovo, Montenegro:

https://en.wikipedia.org/wiki/List_of_sovereign_states_by_da...

(skip down to Sortable list and sort by "Acquisition of sovereignty")


Yes, it uses Supabase, doesn’t roll its own


I just tried the demo on the homepage and I don’t know what kind of sorcery this is but it’s blowing my mind.

I input a bunch of completely made up words (Quastral Syncing, Zarnix Meshing, HIBAX, Bilxer) and used them in a sentence and the model zero-shotted perfect speech recognition!

It’s so counterintuitive for me that this would work. I would have bet that you have to provide at least one audio sample in order for the model to recognize a word it was never trained on.

Providing it to the model in text modality and it being able to recognize it in the audio modality must be an emergent property.


They’re all in. They announced they’ll add support for it in the desktop app and the API in the coming months: https://x.com/OpenAIDevs/status/1904957755829481737


I'm surprised this was announced in a random tweet instead of a blog post with a release roadmap or something like that.


because its a lil embarrassing oai didnt come up with it


Currently supported in the Agents SDK https://openai.github.io/openai-agents-python/mcp/


we started using it recently at my work. the code changes walkthrough is nice


I think the same can be said about AI-assisted writing…

I like the ideas presented in the post but it’s too long and highly repetitive.

AI will happily expand a few information dense bullet points into a lengthy essay. But the real work of a strong writer is distilling complex ideas into few words.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: