I've found that as LLMs improve, some of their bugs become increasingly slippery...

thegeomaster · 2025-06-19T14:40:11 1750344011

I have been using LLMs for coding a lot during the past year, and I've been writing down my observations by task. I have a lot of tasks where my first entry is thoroughly impressed by how e.g. Claude helped me with a task, and then the second entry is a few days after when I'm thoroughly irritated by chasing down subtle and just _strange_ bugs it introduced along the way. As a rule, these are incredibly hard to find and tedious to debug, because they lurk in the weirdest places, and the root cause is usually some weird confabulation that a human brain would never concoct.

throw234234234 · 2025-06-19T23:45:05 1750376705

Saw a recent talk where someone described AI as making errors, but not errors that a human would naturally make and are usually "plausible but wrong" answers. i.e. the errors that these AI's make are of a different nature than what a human would do. This is the danger - that reviews now are harder; I can't trust it as much as a person coding at present. The agent tools are a little better (Claude Code, Aider, etc) in that they can at least take build and test output but even then I've noticed it does things that are wrong but are "plausible and build fine".

I've noticed it in my day-to-day: an AI PR review is different than if I get the PR from a co-worker with different kinds of problems. Unfortunately the AI issues seem to be more of the subtle kind - the things if I'm not diligent could sneak into production code. It means reviews are more important, and I can't rely on previous experience of a co-worker and the typical quality of their PR's - every new PR is a different worker effectively.

DanHulton · 2025-06-19T14:39:52 1750343992

I'm curious where that expectation of the flip comes from? Your experience (and mine, frankly) would seem to indicate the opposite, so from whence comes this certainty that one day it'll change entirely and become reliable instead?

I ask (and I'll keep asking) because it really seems like the prevailing narrative is that these tools have improved substantially in a short period of time, and that is seemingly enough justification to claim that they will continue to improve until perfection because...? waves hands vaguely

Nobody ever seems to have any good justification for how we're going to overcome the fundamental issues with this tech, just a belief that comes from SOMEWHERE that it'll happen anyway, and I'm very curious to drill down into that belief and see if it comes from somewhere concrete or it's just something that gets said enough that it "becomes true", regardless of reality.