I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.
The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.
At that point I just went ahead and read some docs and other resources and solved things the traditional way.
Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.
I was never on board with it. It feels like the same step change google was - there was a time when it was just miles ahead of everything else out there around 1998. The first time you used it, it was like "geez, you got it right, didn't know that was possible". It's big, changed things, but wasn't an end of history event a bunch of people are utterly convinced this is.
Depends what you mean by evolve. I don't think we'll get general AI by simply scaling LLMs, but I think general AI, if it arrives, will be able to trace its lineage very much back to LLMs. Journeys through the embedding space very much feel like the way forward to me, and that's what LLMs are.
Embedding spaces are one thing, LLMs are quite another.
I believe the former are understandable and likely a part of true AGI but the latter a series of hacks, at worst a red herring leading us off the proper track into a deadend.
It may or may not happen but “scam” means intentional deceit. I don’t think anyone actually knows where LLMs are going with enough certainty to use that pejorative.
Yes. I'm pretty sure any engineer working on this knows it's not "a few years away". But it doesn't stop product teams from taking adcvantadge of the hype cycle. Hence, "use deception to deprive (someone) of money or possessions.".
I'm equally sure there are true believers who've drunk the Kool-aid and really believe AGI is right around the corner, just need to fix a few bugs and wait just a few more generations of Moore's law. What difference does the beliefs of a nameless engineer at an AI company make?
>What difference does the beliefs of a nameless engineer at an AI company make?
Hopefully a manager and proper task scheduling. If I made these promises every sprint and kept saying "yea the task is only a week away from completion!" I'd be fired unless I fell down the rabbit hole to Alice in Wonderland. I'm using good faith to assume a lot of those AI engineers are smarter and better schedulers than I am.
But that's what managers and proper scoping and perspective is for. Maybe they're okay with that, but I'd wager any profit motivated company would not keep exploring unless the gains are enormous.
but we're laymen with no attempts to define what AGI means in concrete terms, nevermind remotely come to any agreement over it, so the manager and proper task scheduling is gonna have much more concrete and discreet task in the sprint in jira. it's not like there's just one ticket with a billion points or umptuple XL that says AGI. That's not even gonna be an epic with a million tickets. (or maybe it is. I don't work there, lol)
The people in the trenches there are working on tactical specific things and aren't going to be fired for meeting their internal KPIs which aren't nebulously AGI.
>so the manager and proper task scheduling is gonna have much more concrete and discreet task in the sprint in jira
Yes, that's the true meaning behind my words. The marketing is saying "were working on AGI as we speak!" and that's only bare bones true in the same way that someone is buying a house... While baeeot starting their savings, and unsure or where they are buying and what they want in it. It's barely an idea, per alone "around the corner".
Meanwhile, they are given small, unexciting, but important stepping stones to experiment. Nothing that makes line go up, because that means being truthful. Thars why I hate hype culture. It obscures true progress and honesty towards progress. A distraction because talking about a "thing" in a blue sky is more profitable than talking about the actual stepping stones.
LLMs coding performance is directly proportional to amount of stolen data for the learning process. That's why frontend folks are swearing by it and are forecasting our new god dominance in just a few years. That's because frontend code is literally out there mostly, just take it and compile into a dataset. Stuff like SQL DBs is not laying on every internet corner and is probably very underrepresented in the dataset, producing inferior performance. Same with rare or systems languages, like Rust for example, LLMs are also very bad with it.
It has made me stop using Google and StackOverflow. I can look most things up quickly, not rubber duck with other people, and thus I am more efficient. It also it is good at spotting bugs in a function if the APIs are known and the APIs version is something it was trained on. If I need to understand what something is doing, it can help annotate the lines.
I use it to improve my code, but I still cannot get it to do anything that is moderately complex. The paper tracks with what I've experienced.
I do think it will continue to rapidly evolve, but it probably is more of a cognitive aid than a replacement. I try to only use it when I am tight on time. or need a crutch to help me keep going.
This happens in about one third of my coding interactions with LLMs. I've been trying to get better at handling the situation. At some point it's clear you've explained the problem well enough and the LLM actually is regurgitating the same wrong answer, unable to make progress. It would be useful to spot this asap.
I enjoy working with very strongly typed languages (Elm, Haskell), and it's hard for me to avoid "just paste the compile error to the LLM it only takes a second" trap. At some point (usually around three back-and-forths), if the LLM can't fix the error, it will just generate increasingly different compile errors. It's a matter of choosing which one I decide to actually dive into (this is more of a problem with Haskell than Elm, as Elm compile errors are second to none).
Honest question -- not trying to be offensive, but what are you using elm for? Everywhere I've encountered it it's some legacy system that no one has cared to migrate yet and it's a complete dumpster fire.
You spend about three days trying to get it to build then say fuck it and rewrite it.
At least, that's the story of the last (and only) three times I've seen elm code in the wild.
I'm not really a frontend developer. I'm using Elm for toy projects, in fact I did one recently.[0] Elm is my favourite language!
> You spend about three days trying to get it to build then say fuck it and rewrite it.
What are the problems you encounter? I can't quite imagine in what way an Elm project could be hard to build! (Also not trying to be offensive, but I almost don't believe you!)
And into which language do you rewrite those "dumpster fire" Elm codebases?
typescript usually, the elm frontends tend to be in some abandoned repo which hasn't had a ci run in like 2 years and which instantly fail on missing deps or security controls etc.
Yes, right, I forgot, this also happened to me once: someone deleted their repo for an Elm library. It was salvageable through whatever archive and publishing my own copy.
It happens less often in Elm than in JavaScript though! I'll take "abandoned for two years" Elm project over "abandoned for two years" typescript project anytime!
While I do find llm’s useful, it’s mostly for simple and repetitive tasks.
In my opinion, they aren’t actually coding anything and have no amount of understanding. They are simply advanced at searching things and pasting back an answer that they scraped online. They can also run simple transformations on those snippets like rename variables. But if you tell it there’s a problem, it doesn’t try to actually solve the problem. It just traverses to the same branch in the tree and tries to give you another similar solution in the tree or if there’s nothing better it will give you the same solution but maybe run a transformation on it.
So, in short, learn how to code or teach your kids how to code. Because going forward, I think it’s going to be more valuable than ever.
I had to do something similar with BigQuery and some open source datasets recently.
I had bad results with Claude as you mentioned. It kept hallucinating parts of the docs for the open datasets, coming up with nonsense columns. Not fixing errors when presented the error text and more context. I had a similar outcome with 4o.
But I tried the same with o1 and it was much better consistently, with full generations of queries and alterations. I fed it in some parts of docs anytime it struggled and it figured it out.
Ultimately I was able to achieve what I was trying to do with o1. I’m guessing the reasoning helped, especially when I confronted it about hallucinations and provided bits of the docs.
Maybe the model and the lack of CoT could be part of the challenge you ran into?
Yes, I imagine some do like to read and then ponder over the BigQuery docs. I like to get my work done. In my case, o1 nailed BigQuery flawlessly, saving me time. I just needed to feed in some parts of the open source dataset docs
I do something like this every day at work lol. It's a good base to start with, but often you'll eventually have to Google or look at the docs to see what it's messing up
For what it’s worth, I recently wrote an SQL file that gave an error. I tried to fix it myself and searched the internet but couldn’t solve it. I pasted it into Claude and it solved the error immediately.
I am a paying user of both Claude AI and ChatGPT, I think for the use case you mention ChatGPT would have done better than Claude. At $20/month I recommend that you try it for the same use case. o1 might have succeeded where Claude failed.
Meh. Ballpark they're very similar. I think people overestimate the differences between the LLMs. Similarly to how people overestimate the differences between... people. The difference between the village idiot and Einstein only looks like a big difference to us humans. In the grand scale of things, they're pretty similar.
Now, obviously, LLMs and humans aren't that similar! Different amount of knowledge, different failure modes, etc.
There is no one "SQL", unfortunately. All of the major database engines have their own forks and extensions. If you didn't specify which one you were using (Microsoft SQL, Oracle, Postgres, SQLite, MySQL), then you didn't give the LLM enough information to work with, same as a junior engineer working blindly with only the information you give them.
I left that part out for brevity, but I told Claude the version of Postgres I was using at the start, and even specified that the mistake it produced is invalid in Postgres.
When claude gets in a loop like this the best thing to do is just start over in a new window.
When you line up claude on some good context and a good question it does really well. There are more specialized llms for sql I would try one of those. Claude is a generalist and for that, it's not great at everything.
It's really good at react and python -- as someone else mentioned -- that junior code is public and available.
However, random sql needs more "guiding" via the prompt. Explain more about the data and why it's wrong. Tell claude, "I think you're producing slop" and he will break out of his loop.
I've had a pretty similar outlook and still kind of do, but I think I do understand the hype a little bit: I've found that Claude and Gemini 2 Pro (experimental) sometimes are able to do things that I genuinely don't expect them to be able to do. Of course, that was the case before to a lesser extent already, and I already know that that alone doesn't translate to useful necessarily.
So, I have been trying Gemini 2 Pro, mainly because I have free access to it for now, and I think it strikes a bit above being interesting and into the territory of being useful. It has the same failure mode issues that LLMs have always had, but honestly it has managed to generate code and answer questions that Google definitely was not helping with. When not dealing with hallucinations/knowledge gaps, the resulting code was shockingly decent, and it could generate hundreds of lines of code without an obvious error or bug at times, depending on what you asked. The main issues were occasionally missing an important detail or overly complicating some aspect. I found the quality of unit tests generated to be sub par, as it often made unit tests that strongly overlapped with each other and didn't necessarily add value (and rarely worked out-of-the-box anyways, come to think of it.)
When trying to use it for real-world tasks where I actually don't know the answers, I've had mixed results. On a couple occasions it helped me get to the right place when Google searches were going absolutely nowhere, so the value proposition is clearly somewhere. It was good at generating decent mundane code, bash scripts, CMake code, Bazel, etc. which to me looked decently written, though I am not confident enough to actually use its output yet. Once it suggested a non-existent linker flag to solve an issue, but surprisingly it actually did inadvertently suggest a solution to my problem that actually did work at the same time (it's a weird rabbit hole, but compiling with -D_GNU_SOURCE fixed an obscure linker error with a very old and non-standard build environment, helping me get my DeaDBeeF plugin building with their upstream apbuild-based system.)
But unfortunately, hallucination remains an issue, and the current workflow (even with Cursor) leaves a lot to be desired. I'd like to see systems that can dynamically grab context and use web searches, try compiling or running tests, and maybe even have other LLMs "review" the work and try to get to a better state. I'm sure all of that exists, but I'm not really a huge LLM person so I haven't kept up with it. Personally, with the state frontier models are in, though, I'd like to try this sort of system if it does exist. I'd just like to see what the state of the art is capable of.
Even that aside, though, I can see this being useful especially since Google Search is increasingly unusable.
I do worry, though. If these technologies get better, it's probably going to make a lot of engineers struggle to develop deep problem-solving skills, since you will need them a lot less to get started. Learning to RTFM, dig into code and generally do research is valuable stuff. Having a bot you can use as an infinite lazyweb may not be the greatest thing.
Makes perfect sense why it couldn't answer your question, you didn't have the vocabulary of relational algebra to correctly prime the model. Any rudimentary field have their own corpus vocabulary to express ideas and concepts specific to that domain.
Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.
> unless non-SWE people using LLMs become technical enough to get what they need out of them
Non-SWE person here. In the past year I've been able to use LLMs to do several tasks for which I previously would have paid a freelancer on Fiverr.
The most complex one, done last spring, involved writing a Python program that I ran on Google Colab to grab the OCR transcriptions of dozens of 19th-century books off the Internet Archive, send the transcriptions to Gemini 1.5, and collect Gemini's five-paragraph summary of each book.
If I had posted the job to Fiverr, I would have been willing to pay several hundred dollars for it. Instead, I was able to do it all myself with no knowledge of Python or previous experience with Google Colab. All it cost was my subscription to ChatGPT Plus (which I would have had anyway) and a few dollars of API usage.
I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
> I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
I think this is the nuance most miss when they think about how AI models will displace work.
Most seem to think “if it can’t fully replace a SWE then it’s not going to happen”
When in reality, it starts by lowering the threshold for someone who’s technical but not a SWE, to jump in and do the work themselves. Or it makes the job of an existing engineer more efficient. Each hour less work needed spread across many tasks that would have otherwise gone to an engineer eventually sum up to a full time worth of an engineer. If it’s a Fiverr dev you eliminated the work of, that means the Fiverr dev will eventually go after the work that’s remaining, putting supply pressure on other devs
It’s the same mistake many had about self driving cars not happening because they couldn’t handle every road. No, they just need to start with 1 road, master that, and then keep expanding to more roads. Until they can do all of SF, and then more and more cities
Entirely possible. Have you got any numbers and real world examples? Growth? Profits? Actual quantified productivity gains?
The nuance your 'gotcha' scenario miss is that displacing fiverr, speeding up small side project, making scripts fo non-SWE, creating boilerplate, etc is not the trillions of dollars disruption that is needed by now.
This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.
That being said I’m very bullish on AI being able to handle more and more of this very soon. Cursor definitely does a great job giving us a taste of cross codebase understanding.
Seconded. Zed makes it trivial to provide entire codebases as context to Claude 3.5 Sonnet. That particular model has felt as good as a junior developer when given small, focused tasks. A year ago, I wouldn’t have imagined that my current use of LLMs was even possible.
not sure about Claude but my main problem with 03-mini is that it 'forgets' the things which are supposed to fit in the context window. This results in it using different function names, data structures. I think it's guessing them instead of fetching from the previous records.
> This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.
The agent I'm working on (RA.Aid) handles this by crawling and researching the codebase before doing any work. I ended up making the first version precisely because I was working on a larger monorepo project with lots of files, backend, api layer, app, etc.
So I think the LLMs can do it, but only if techniques are used to allow it to hone in on the specific information in a codebase that is relevant to a particular change.
If the goal is to get something to run correctly roughly once with some known data or input, then that's fine. Actual software development aims to run under 100% of circumstances, and LLMs are essentially cargo culting the development process and entrusting an automation that is unreliable to do mundane tasks. Sadly the quality of software will keep going down, perhaps even faster.
Stop with the realism, one off scripts is going to give trillions in ROI any day now. Personally could easily chip in maybe a million a month in subsbription fees be cause my bolierplate code I write once in a blue moon has speed up infinitely and I will cash out in profits any day now.
IOW LLMs make programming somewhat higher-level similar to what new programming languages in the past, either via code generation from natural language (main use-case right now?), or by interpreting a "program" written in natural language ("sum all the numbers in the 3rd column of this CSV").
The latter case enables more people to program to a certain extent, similar to what spreadsheets did, while we still need full SWEs in the first case, as you pointed out.
> The Registered Skilled Reporter (RSR) is NCRA's new designation that will recognize those stenographic professionals who are looking to validate their beginning level of competency.
> You have to pass three five-minute Skills Tests (SKT), which evaluate your skills level in three areas: Literary at 160 wpm, Jury Charge at 180 wpm, Testimony/Q&A at 200 wpm.
This is the National Court Reporter's Association; participating membership is for, as you say, stenographic court reporters and [stenographic] captioners, CART providers, and the like. Their membership FAQ mentions transcriptionists as eligible for associate membership -- not independently, though, only in a role supporting stenographic professionals (so really more like scopists, I believe).
Note also that the speeds listed are described as beginning level. Contrast this with the speed contest(1) featuring literary (i.e., a speech or some kind of governmental literature or something to that effect) read at 200-220 WPM, jury charge (instructions) read at 200-260 WPM, and Q&A ([witness] testimony) read at 280 WPM.
These are the kind of speeds that have been typical of stenographers pretty much as long as it's been a thing, even when it was done with a pen rather than a steno machine -- well, back into the 19th century at least; I personally can't speak to the performance of the earlier shorthand systems off the top of my head.
Those "beginning" speeds are about at the top of what most of the best longhand typists can do at any serious length (see, for example, hi-games.net typing leaderboard for a 5-minute(2) vs 10-second(3) test).
As to the WPM cutoff to be considered a "typist"? I mean, it's not like it's a professional credential or anything. Anyone can be a typist if they're typing, I suppose, or if they choose to take it seriously enough. Even the de facto standards of job requirements are nothing much to go by: The typing speeds listed as required in the job postings for nearly all customer service, tech support, general office, and other such jobs, quite frankly, range from underwhelming to laughable. Even transcriptionists (longhand, as in not steno, and offline, as in not real-time) don't need to type more than about 80 WPM to find work in the field, if even that much. In my view, 80 WPM is still an awfully tedious sort of speed, but I understand it's more commonly considered a respectable one, and more than adequate for most tasks, so I guess I'd be fine with that number if I had to pick one.
I'll also throw in with jb-wells above and say that anyone who's touch-typing (and preferably making progress into the triple digits, or so far as their own ability will allow) might as well be considered a typist -- or anyone who managed to convince someone to pay them to type things at any speed.
I guess 60-70 WPM at >95% accuracy? I have not obtained the needed certification :)
My mom went to a secretary/business assistant school in the '70s and the typing class (on a typewriter!) required using ten fingers and touch typing.
The expectation was you'd be fast enough to transcribe someone dictating (they learned to use stenography for faster situations).
If the llm can’t find me a solution in 3 to 5 tries while I improve the prompt I fall back to mire traditional methods and or use another model like Gemini.
Yeah, I tried Copilot for the first time the other day and it seemed to be able to handle this approach fairly well -- I had to refine the details, but none of it was because of hallucinations or anything like that. I didn't give it a chance to try to handle the high-level objective, but based on past experience, it would have done something pointlessly overwrought at best.
Also, as an aside, re "not a real programmer" salt: If we suppose, as I've been led to believe, that the "true essence" of programming is the ability to granularize instructions and conceptualize data flow like this, and if LLMs remain unsuitable for coding tasks unless the user can do so, this would seem to undermine the idea that someone can only pretend to be a programmer if they use the LLMs.
Anyway, I used Copilot in VSCode to "Fix" this "code" (it advised me that I should "fix" my "code" by . . . implementing it, and then helpfully provided a complete example):
# Take a URL from stdin (prompt)
# If the URL contains "www.reddit.com", replace this substring with "old.reddit.com"
# Curl the URL and extract all links matching /https:\/\/monkeytype\.com\/profile\/[^>]+/ from the html;
# put them in a defaultdict as the first values;
# for each first value, the key is the username that appears in the nearest previous p.tagline > a.author
# For each first value, use Selenium to browse to the monkeytype.com/profile url;
# wait until 'div[class=\'pbsTime\'] div:nth-child(3) div:nth-child(1) div:nth-child(2)' is visible AND contains numbers;
# assign this value as the second value in the defaultdict
# Print the defaultdict as a json object
This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.
This mirrors what I've seen. I've found that LLMs are most helpful in places where I have the most experience.
Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.
But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.
The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.
And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.
I just had this conversation with a customer. And it’s hard to avoid anthropomorphizing ai. Once you equate the ai system with a human - a human who creates perfectly pep8 formatted python is probably a decent python programmer, whereas someone who bangs out some barely readable code with mixed spacing and variable naming styles is most likely a novice.
We use these signals to indicate how much we should trust the code - same with written text. Poorly constructed sentences? Gaps or pauses? Maybe that person isn’t as knowledgeable.
These shortcuts fail miserably on a system that generates perfect grammar, so when you bring your stereotypes gleaned from dealing with humans into the ai world, you’re in for an unpleasant surprise when you unpack the info and find it’s only about 75% correct, despite the impeccable grammar.
> But if I have _no_ idea things go wildly off course.
This is the key to getting some amount of productivity from LLMs in my experience, the ability to spot very quickly when they veer off course into fantasyland and nip it in the bud.
Then you point out the issue to them, they agree that they made a dumb mistake and fix it, then you ask them to build on what you just agreed to and they go and reintroduce the same issue they just agreed with you was an obvious problem... because ultimately they are more fancy auto complete machines than they are actual thinking machines.
I have found them to be a time saver on the whole even when working with new languages but I think this may in large part be helped by the fact that I have literally decades of coding experience that sets off my spidey senses as soon as they start going rampant.
I can't begin to imagine how comical it must be when someone who doesn't have a strong programming foundation just blindly trusts these things to produce useful code until the runtime or compile time bugs become unavoidably obvious.
It's the opposite. An LLM is better at CEO stuff than working code. A good developer + LLM instead of CEO can succeed. A good CEO + LLM instead of developer cannot succeed. (For a tech company)
Is that a fact? I mean, see the linked article; even the company whose whole business model lies in convincing people that that _is_ a fact is kinda saying “yeah, perhaps not”, with vague promises of jam tomorrow.
Even better, if you click through to the linked source he doesn't say "low-level" at all, or make any claim that is at all like the claim he is cited as making!
LLMs are still just text generators. These are statistical models that cannot think or solve logical problems. They might fool people, as Weizenbaum's "Eliza" did in the late 60s, by generating code that sort of runs sometimes, but identifying and solving a logic problem is something I reliably see these things fail at.
Have you tried the latest models, using them with Cursor etc? They might not be truly intelligent but I’d be surprised if an SWE can’t see that they are already offering a lot of value.
They probably can’t solve totally novel problems but they are good at transposing existing solutions to new domains. I’ve built some pretty crazy stuff with just prompts - granted I can prompt with detailed technical instructions when needed as I’m a SWE, similar to instructing a junior. I’ve built prototypes which would take days in hours which to me is hugely exciting.
The code quality of pure AI generated code isn’t great but my approach right now is to use that to prototype things mostly with prompts (it takes as much time to build a prototype as it would to create a mock up or document explaining the idea previously) then once we are committed to it, I’ll rebuild it mostly by hand but using Cursor to help.
I’ve got 15 years of coding experience at some of the biggest tech companies. My personal opinion is that most people have no clue how good these AI coding systems already are. If you use something like RepoPrompt, where you selectively choose which files to include in the prompt, and then also provide a clear description of what changes you want to make—along with a significant portion of the source code—a model like O1Pro will nail the solution the first time.
The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.
Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.
Are you suggesting that OpenAI published a paper assessing their own models on real-world problems, but failed to properly use their own models? And/or that you know better than OpenAI scientists how to use OpenAI models most effectively?
But telling us that the designers of a product are stupid and don't know how to use their own product when they're disclosing its limitations should really come with more than a "trust me bro" as evidence.
I find the framing of this story quite frustrating.
The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.
It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!
Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.
Shhh ... you're spoiling everybody's confirmation bias against LLMs. They are obviously terrible at coding, just as we have known all along, and everybody should laugh at them. Nothing to see here!
Since you are one of the cool kids in the know, can you share the road map to profitability and even better the expected/hyped ROI? Without extrpolations into science fiction, please.
I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality
I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.
So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.
To all those devs saying "i tried it and it wasnt perfect the first time, so I gave up", I am reminded of something my father used to say:
"A bad carpenter blames his tools"
So AI not going to answer your question right on its first attempt in many cases. It is forced to make a lot of assumptions based on the limited info you gave it, some of those may not match your individual case. Learn to prompt better and it will work better for you. It is a skill, just like everything else in life.
Imagine going into a job today and saying "i tried google but it didnt give me what I was looking for as the first result, so I dont use google anymore". I just wouldnt hire a dev that couldnt learn to use AI as a tool to get there job done 10x faster. If that is your attitude, 2026 might really be a wake-up call for your new life.
One of the reasons SO works is that the correct or best answer tends to move toward the top of the list. AI struggles to do this reliably - and so its closer to SO where the answers are randomly selected and you have to try a few to get the one that's correct.
Also, I bet your father imagined the carpenter's toolbox full of well accepted useful tools. For many of us, non-bad carpenters, AI hasn't made the cut yet.
All I am saying is that if you are expecting it to fail for you, that is absolutely what it will do. You weren't good at typing on your first day either. I went from a decent engineer with 20 years experience to a 10x engineer able to take on any problem, in less than a year. All because I learned how to use the tool effectively.
Just dont give up, get back on that bike and keep peddling. I promise it will amaze you if you give it a chance.
This is a tool, just like all the other ones you have learned, but it will make you a far better engineer. It can fill all those gaps in your understanding of code. You can ask it all those questions you are unwilling to ask your colleagues, because you think you will sound dumb for not knowing. You can ask it to explain everything again if you still dont get it. It is powerful if you know how to use it.
> OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year
This is the “self-driving cars next year, definitely” of the 20s, at this point.
The benchmark for AI models to assess their 'coding' ability should be on actual real world production-grade repositories and fixing bugs in them such as the Linux kernel, Firefox, sqlite or other large scale well known repositories.
Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.
If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?
For anyone who don't know what IOI and IMO refers to;
IOI refers to the International Olympiad in Informatics, a prestigious annual computer science competition for high school students, while IMO refers to the International Mathematical Olympiad, which is a world-renowned mathematics competition for pre-college students.
> The models weren't allowed to access the internet
How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.
> How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
Many, there was a time when SO did not exist and people were able to solve non trivial problems. There was a time coding problems on exams had to be solved on paper and if they were not compiling you would not pass.
you miss my point about zero short style where you have only one shot to compile and execute you code. Even in old times when people programmed using punched cards it required a lot of reviews and iterations. This is the reason why scripting languages like python, ruby, php, javascript got popular because you had very fast feedback loop and do dozens of mini experiments. Majority of coding problems we have today are not algorithmic in nature.
What would searching the Internet provide the models that they don’t already have? Most likely data sources such as stack overflow, documentation on the language it’s targeting, and a variety of relevant forum posts are already part of its training set.
Unless someone else came along and said “here’s how to solve x problem step by step”, I don’t see how additional information past its cutoff point would help. (Perhaps the AI could post on a forum and wait for an answer?)
Yes, iterative programming could help via access to tools- I can see that helping.
Why do programmers search for specific questions rather than always relying on their inherent knowledge?
I’m a crappy hobbyist programmer but for me it is useful to see if someone has implemented exactly what I need, or debugged the problem I’m having. I don’t think it’s reasonable to expect programmers or LLMs to know everything about every library’s use in every context just from first principles.
I do it to save the limited brain power I have before rest or food is required. You could spend 5 minutes writing a sort (at a high level processing) or just use existing code which might take 5 minutes to find but uses less brain power.
This allows you to use that brain power on specific things that need you and let google remember the format of that specific command or let an ai write out your routing file.
The older I get the less I'm bound by time, lack of knowledge or scope but more limited by clarity. Delegate tasks where possible and keep the clarity for the overall project and your position.
But why would that information not be included in the wide crawl already encoded in the model weights before the knowledge cutoff? I believe the article mentions frontier models so we are talking about models trained on trillions of tokens here
In addition to cutoff dates, models do not encode every single thing from the training set verbatim. One forum post somewhere about Foo library v13.5.3 being incompatible with Bar 2.3 and resulting in ValueErrors is not going to make it.
Because cutoff can be like few months ago and you still have new versions of libraries being developed every month. API getting deprecated or removed or new API being added. Model need to have access to the latest API or SDK that is available and know e.g. what iOS SDK you have currently available and what MacOS version you have etc. Having access to github issues also help to figure out if there is bug in library.
You sound like someone who never used punch cards.
I think most developers could do that if they trained. As someone who learned how to program before the internet, its just a different mindset and would take some time to adjust.
I am doing that now where changes take a day to make it to staging and no local environment. You roll with it.
It depends a lot on the type of problem. If we're talking about fixing a bug or adding a new feature to a large existing code base, which probably describes a huge portion or professional software engineering work, I would say most engineers could do most of those tasks without the internet. Especially if the goal is to simply pass a benchmark test of getting it working without future considerations.
I barely ever look at StackOverflow as the quality of answers there is so poor. It was once good but the proliferation of duplicates[1] has really ruined it for me, as well as outdated answers not being replaced. Google search results are also crap.
I agree with your point, though. The "LLM" model just isn't a good fit for some tasks, in fact many tasks. It is good for creative writing, but even then only really because our standards for creative writing are pretty low. It doesn't write with any real creativity or flair in the writing. It can make things up and stay on topic. It is poor for anything where accuracy matters. It can't edit what it produces! Nobody writes things in one shot in reality, not even creative writing, but especially not code or technical writing. It needs to be able to do a whole suite of other things: move blocks of output around, rewrite chunks, expand chunks, condense chunks, check chunks against external sources or proper knowledge banks, compare chunks for internal consistency, and more. That is how we operate: at the level of functions or blocks of code, at the level of paragraphs and sentences and sections.
[1]: Yes, the opposite of the problem people here usually have with it, which is things being closed as duplicates. I think more duplicates should be deleted and redirected to a canonical answer, which is then a focus of improvement. Too often google searches give me barely answered or unanswered duplicates and I have to click around in the site to find the result Google clearly should have given me in the first place (better keyword matches, not closed, higher score, etc). I think StackOverflow do this intentionally so people have to click on more pages and see more ads.
I think about this a lot. AI in the current state is like working with an intern who is on a stranded island with no internet access or compiler, they have to write down all of the code in forward sequence on piece of paper, god help them if they have to write any UI while also being blind. None of the "build an app with AI start-to-finish" products work well at all because of this.
AI models are trained on the data from the internet, so sure, they couldn't do their search feature to scour the internet, but I doubt the material is much different than what the models were already trained on.
Additionally, before the age of stackoverflow and google, SWEs cracked open the book or documentation for whatever technology they were using.
Interviews like leetcode on whiteboard only testing your reasoning not if your solution will execute out of the box in zero shot style. Humans solve problem in iterative way that's why fast feedback loop and access to tools is essential. When you start coding compiler or linter hints you that you forgot to close some braces or miss semicolon. Compiler tips you that API in new version changed, intellisense hints you what methods you can use in current context and what parameters you can use and their types. Once you execute program you get runtimes tips that maybe you missed installing some node or python package. When you installing packages you get hints that maybe one package has additional dependency and 2 package version are not compatible. Command line tools like `ls` tells you what's project structure etc.
> How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
Isn't point of the training that they already have all the information they could have. So they do not need the Internet as on Internet there would only be information they already "know"...
They tested with programming tasks and manager's tasks.
The vast majority of tasks given require bugfixes.
Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.
The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)
Personally, I have other concerns:
- A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through
- An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.
- LLM use increases code churn in a code base
- Increased code churn is known to be bad the health of projects
Increasingly I think that the impact of generative AI is going to more of an incremental form of disruption than revolutionary. More like spreadsheets than the printing press.
Spreadsheets becoming mainstream made it easy to do computing that once took a lot of manual human labor quite quickly. And it made plenty of jobs and people who do them obsolete. But they didn’t upend society fundamentally or the need for intelligence and they didn’t get rolled out overnight.
Coding, especially the type mentioned in the article (building an app based on a specification)—is a highly complex task. It cannot be completed with a single prompt and an immediate, flawless result.
This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.
We should consider a few things before asking, "Can AI code like humans?":
- How did AI learn to code? What structured curriculum was used?
- Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?
- Did the AI learn through hands-on coding or just by reading Stack Overflow?
If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?
Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!
Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.
LLMs can fundamentally only do something similar to learning in the training phase. So by the time you interact with it, it has learned all it can. The question we then care about is whether it has learned enough to be useful for problem X. There's no meaningful concept of "how intelligent" the system is beyond what it has learned, no abstract IQ test decoupled from base knowledge you could even conceive of.
It didn't, it's just very good at copying already existing code and tweeking it a bit.
>Did AI receive mentoring from an experienced senior
It doesnt even comprehend what an experienced senior is, all it cares about is how frequently certain patterns occurred in certain circumstances.
>Did the AI learn through hands-on coding or just by reading Stack Overflow?
it "learnt" by collecting a large database of existing code, most of which is very low quality open source proofs of concept, then spits out the bits that are probably related to a question.
I think we're drastically oversimplifying what "pattern matching" means. It is also one of the fundamental mechanisms by which the human brain operates. I believe we are consciously (or perhaps subconsciously) conditioned to think that human "logic" and "reasoning" are several degrees more advanced than pattern matching. However, I don't think this is true.
The fundamental difference lies in how patterns are formed in each case. For LLMs, all they know are the patterns they observe in "words" - that is the only "sense" they possess. But for humans, pattern recognition involves continuously ingesting and identifying patterns across our five primary senses—not just separately, but simultaneously.
For example, when an LLM describes something as "ball-shaped," it cannot feel the shape of a ball because it lacks another sense to associate with the word "ball-shaped." In contrast, humans have the sense of touch, allowing them to associate the word or sound pattern "ball" with the physical sensation of holding a ball.
>`It is also one of the fundamental mechanisms by which the human brain operates.
One of the fundamental mechanisms by which brains operate. The bits we share with every other animal with a brain,
good luck teaching your dog to code.
being great at fetching your newspaper in the morning doesn't mean its going to wake up and write you an accounting software package at the end of the year.
We don't even need that example. The example is in front of us. Take a smaller parameter model and ask it to do the same complex thing that a larger parameter model did. It will struggle.
Btw, I'm not saying it's just the number of parameters that matters.
"Here's the choice you have when you are faced with something new. You can take this technological advance and decide "This is a better way of doing the stuff I'm doing now and I can use this to continue on the path that I'm going", so that's staying in the pink plane , or you can say "This is not a better old thing, this is almost a new thing and I wonder what that new thing is trying to be" and if you do that there's a chance of actually perhaps gaining some incredible leverage over simply optimizing something that can't be optimized very much. - Kay
Current LLMs will change the world, but it won't be by completing pull requests quickly.
Although a "stargate level" LLM could accelerate pink plane traversal so much that you don't even need to find the correct usecase. LLM scaling will be the computer graphics scaling of this generation. In terms of intelligence gpt4 based o3 is but a postage stamp. As LLMs scale a picture of intelligence will emerge.
LLMs will never solve this problem, they are basically just glorified copy & paste engines, solving real code problems requires invention, even for most basic tasks. The best they will manage in their current direct is reason they don't have the capability or capacity to actually solve the problem rather than just getting it wrong the vast majority of the time.
I believe it. I couldn't even get o1 or claude 3.5 to write a tampermonkey script that would turn off auto-scroll to bottom in LibreChat, even when uploading the html and javascript as context.
Apparently it has to do with overflow anchor or something in React? Idk. I gave up.
I prompted up a very basic Flask scaffold via Windsurf and once it reached a certain code size, it just started to remove or weirdly rewrite old parts to handle the context. ("You're right let's move that back in"). Didn't end well.
it's so much easier to learn from examples than from documentation in my opinion, documentation is, what I use when I want to know additional parameters or downsides of a functionality. I'm no coder though.
what I'd like LLMs to do is present examples using acceptable design standards, e.g. whats the pythonic way to do this, and what are exceptions that might yield better performance/optimization (at what does it cost), or what is the best go(lang) JSON parser (since the built-in isn't very good).
But instead, I get average to below-average examples (surprise surprise, this is what happens when you train on a high noise-to-signal set of data), which are either subtly or wildly incorrect. I can't see this improving, with reddit and other forums trying to introduce AI bot written posts. Surely these companies are aware of how LLM output degenerates when fed its own input within a few (not even dozen) generations?!?
AI don't "solve" problems, best it can do is remember them. Ask them to solve anything new that's challenging and it starts to hallucinate.
At least currently.
And I'm ashamed that OpenAI and Sam altman are walking around talking about AGI. And I'm so... disillusioned by the entire tech community that they have fallen for it or they at least pretend to believe it. It's like LinkedIn Where everybody pretends to be cringe, positivity, people. Even though they know it's cringe and nobody believes it.
Whenever AI hallucinates a little complex SQL or some tool / language it doesn't have much training data on, I think of AGI hype and Sam Altman's words on how AI can be used to cure cancer in near future.
Instead if they rightly just said it is an useful tool to be used by researchers to help them like a smart calculator for big data, it would be so much more honest and correct.
Not sure what they found. Either model is unable, or they were unable to solve the tasks using models. Loos like they used strait questions and not Chain of Thoughts. The result for the same model depends on how you ask. The tasks probably required more thinking under the hood than model is allowed to do in one request. More interesting would be if model is capable of solving given enough time. Using multiple requests orchestrated by some framework automatically.
The biggest scam in AI is Salesforce. They’re going to take some crappy model that they made or simply switch to a better open source model. Then they’re going to make a large deal with a cloud provider to spin up all these models for all their customers and then re-sell it as AI to their customers for 100x what OpenAI gets per month. And the quality will be lower.
To me o1 is pretty good. I dunno how it would digest an entire codebase and solve a bug in it. Those details weren't obvious to me from the article above. But o1 has certainly been very valuable to me in coding in new languages on the fly.
LLMs at this point are just a great replacement for Stack Overflow. If what you're doing has been heavily documented and you just need a primer or some skeletal sample code, the LLMs are great.
They are not creative at all, but 99% of my job is not creative either.
Despite the lack luster coding performance, AI has PROVEN its able to provide a rationale for profit taking job cuts, layoffs, reduced stock grants, and increased executive bonuses.
I believe the outcome of this type of article is actually positive. The ‘SWE-Lancer’ benchmark provides visibility into a more pragmatic assessment of LLM capabilities.
Ironically it actually refutes Altman’s claims mentioned in the same article . Hard to replace engineers when you create a benchmark you can’t score decently on.
Or it could be a case of: Never prepare a benchmark/prep a comparison which you think you won't succeed at. This is especially true when you are funded by mostly private/VC investors. Time will tell.
I think they are trying to frame the narrative; then succeed at it. Let's see. This helps justify OpenAPI's validation and efforts to investors/VC's. After all; IMO without coding as a use case for LLM's AI wouldn't nearly have the same hype/buzz as it does now. Greed (profit) and fear (losing jobs) are a great motivator to keep investment hype and funds coming in.
What kinds of problems are you talking about? There are problems that require you to learn new libraries and services constantly, accessing documentation, and there are actual software problems when you have to reflect on your own large code base. I work on both kinds of problems, and in the first case, the models are actually well versed in say, all of CloudFormation syntax, that I would have to look up. On the opposite end, I have written many features on trips, unable to be distracted by the internet, just me and the code, and being able to read library source code.
The fact is, programming requires abstract modeling that language models aren’t demonstrating the capability of fully replicating. At least, not that we can see, yet.
A decent human programmer with experience in a particular domain may rely on internet access to look up API documentation and other generic references, but if you read the paper, you'll see that the AI systems tested suffered from more basic deficiencies in approach and reasoning ('3.6. Discussion', starting on page 7).
We did have usenet in the 80s and gopher in the 90s. But yes, in those days it was that mythical "paper" stuff (or were we still using papyrus? I forget)
Seriously? I think I've been most productive pre-internet days when stuck on a trans-atlantic flight with a java (shudder, talk about PTSD) reference manual and laptop that could barely last 2 hours on battery, with emphasis on the measure-twice, cut once mentality.
It's painful to watch junior coders copy-n-paste from SO or W3schools (including code samples clearly labelled not-for-production) with little effort to understanding what they are doing.
The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.
At that point I just went ahead and read some docs and other resources and solved things the traditional way.
Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.