When I ran this experiment it was pretty exhilarating for a while. Eventually it turned into QA testing the work of a bad engineer and became exhausting. Since I had sunken so much time into it I felt pretty bad afterwards that not only did the thing it made not end up being shippable, but I hadn't benefitted as a human being while working on it. I had no new skills to show. It was just a big waste of time.
So I think the "second way" is good for demos now. It's good for getting an idea of what something can look like. However, in the future I'll be extremely careful about not letting that go on for more than a day or two.
I believe the author explicitly suggests strategies to deal with this problem, which is the entire second half of the post. There’s a big difference between when you act as a human tester in the middle vs when you build out enough guardrails that it can do meaningful autonomous work with verification.
I'm just extremely skeptical about that because I had many ideas like that and it still ended up being miserable. Maybe with Opus 4.5 things would go better though. I did choose an extremely ambitious project to be fair. If I were to try it again I would pick something more standard and a lot smaller.
This is so relatable it's painful: many many hours of work, overly ambitious project, now feeling discouraged (but hopefully not willing to give up). It's some small consolation to me to know others have found themselves in this boat.
+1... like with a large enough engineering team, this is ultimately a guardrails problem, which in my experience with agentic coding it’s very solvable, at least in certain domains.
Like with large engineering teams I have little faith people will suddenly get the discipline to do the tedious, annoying, difficult work of building good enough guardrails now.
We don't even build guardrails that keep humans who test stuff as they go from introducing subtle bugs by accident; removing more eyes from that introduces new risks (although LLMs are also better at avoiding certain types of bugs, like copypasta shit).
"Test your tests" gets very difficult as a product evolves and increases in complexity. Few contracts (whether unit test level or "automation clicking on the element on the page") level are static enough to avoid needing to rework the tests, which means reworking the testing of the tests, ...
I think we'll find out just how low the general public's tolerance for bugs and regressions is.
But I am not so pessimistic. I do think it will be possible, because it is more fun to test your tests now than in the pre-LLM era. You just need a little bit of knowledge and patience, and the LLM absorbs most of the psychic pain.
If programmers get accustomed to doing their tests of tests, software might actually get better.
I've had the opposite results. I used to "vibe code" in languages that I knew, so that I could review the code and, I assumed, contribute myself. I got good enough results that I started using AI to build tools in languages I had no prior knowledge of. I don't even look at the code any more. I'm getting incredible results. I've been a developer for 30+ years and never thought this would be possible. I keep making more and more ambitious projects and AI just keeps banging them out exactly how I envision them in my mind.
To be fair I don't think someone with less experience could get these results. I'm leveraging every thing I know about writing software, computer science, product development, team management, marketing, written communication, requirements gathering, architecture... I feel like vibe coding is pushing myself and AI to the limits, but the results are incredible.
I don't want to dox myself since I'm doing it outside my regular job for the most part, but frameworks, apps (on those frameworks), low level systems stuff, linux-y things, some P2P, lots of ai tools. One thing I find it excels at is web front-end (which is my least favorite thing to actually code), easily as good as any front-end dev I've ever worked with.
I think my fatal error was trying to make something based on "novel science" (I'll be similarly vague). It was an extremely hard project to be fair to the AI.
It is my life goal to make that project though. I'm not totally depressed about it because I did validate parts of the project. But it was a let down.
Baby steps is key for me. I can build very ambitious things but I never ask it to do too much at once. Focus a lot on having it get the docs right before it writes any code (it'll use the docs) make the instructions reflexive (i.e. "update the docs when done"). Make libraries, composable parts... I don't want to be condescending since you may have tried all of that, but I feel like I'm treating it the same as when I architect things for large teams, thinking in layers and little pieces that can be assembled to achieve what I want.
I'll add that it does require some banging your head against the wall at times. I normally will only test the code after doing a bunch of this stuff. It often doesn't work as I want at that point and I'll spend a day "begging" it to fix all of the problems. I've always been able to get over those hurdles, and I have it think about why it failed and try to bake the reasoning into the docs/tests... to avoid that in the future.
I did make lots of design documents and sub-demos. I think I could have been cleverer about finding smaller pieces of the project which could be deliverables in themselves and which the later project could depend on as imported libraries.
This happened to me too in an experimental project where I was testing how far the model could go on its own. Despite making progress, I can't bare to look at the thing now. I don't even know what questions to ask the AI to get back into it, I'm so disconnected from it. Its exhausting to think about getting back into it; id rather just start from scratch.
The fascinating thing was how easy it was to lose control. I would set up the project with strict rules, md files and tell myself to stay fully engaged, but out of nowhere I slid into compulsive accept mode, or worse told the model to blatantly ignore my own rules I set out. I knew better, but yet it happened over and over. Ironically, it was as if my context window was so full of "successes" I forgot my own rules; I reward-hacked myself.
Maybe it just takes practice and better tooling and guardrails. And maybe this is the growing pains of a new programmers mindset. But left me a little shy to try full delegation any time soon, certainly not without a complete reset on how to approach it.
I’ll chime in to say that this happened to me as well.
My project would start good, but eventually end up in a state where nothing could be fixed and the agent would burn tokens going in circles to fix little bugs.
So I’d tell the agent to come up with a comprehensive refactoring plan that would allow the issues to be recast in more favorable terms.
I’d burn a ton of tokens to refactor, little bugs would get fixed, but it’d inevitably end up going in circles on something new.
That's kind of what learning to code is like, though. I assume you're using an llm because you don't know enough to do it entirely on your own. At least that's where I'm at and I've had similar experiences to you. I was trying to write a Rust program and I was able to get something in a working state, but wasn't confident it was secure.
I've found getting the llm to ingest high quality posts/books about the subject and use those to generate anki cards has helped a lot.
I've always struggled to learn from that sort of content on my own. That was leading me to miss some fundamental concepts.
I expect to restart my project several more times as I find out more of what I need to know to write good code.
Working with llms has made this so much easier. It surfaces ideas and concepts I had no idea about and makes it easy to convert them to an ingestible form for actual memorization. It makes cards with full syntax highlighting. It's delightful.
(I know you're replying to another guy but I just saw this.) I've been programming for 20 years, but I like the LLM as a learning assistant. The part I don't like is when you just come up with craftier and craftier ways to yell at it to do better, without actually understanding the code. The project I gave up on was at almost a million lines of code generated by the LLM, so it would have been impossible to easily restart it.
"Test the tests" is a big ask for many complex software projects.
Most human-driven coding + testing takes heavy advantage of being white-box testing.
For open-ended complex-systems development turning everything into black-box testing is hard. The LLMs, as noted in the post, are good at trying a lot of shit and inadvertently discovering stuff that passes incomplete tests without fully working. Or if you're in straight-up yolo mode, fucking up your test because it misunderstood the assignment, my personal favorite.
We already know it's very hard to have exhaustive coverage for unexpected input edge cases, for instance. The stuff of a million security bugs.
So as the combinatorial surface of "all possible actions that can be taken in the system in all possible orders" increases because you build more stuff into your system, so does the difficulty of relying on LLMs looping over prompts until tests go green.
"I reward-hacked myself" is a great way to put it!!
AI is too aware of human behavior, and it is teaching us that willpower and config files are not enough. When the agent keeps producing output that looks like progress, it is hard not to accept. We need something external that pushes back when we don't.
That is why automated tests matter: not just because they catch bugs (though they do), but because they are a commitment device. The agent can't merge until the tests pass. "Test the tests" matters because otherwise the agent just games whatever shallow metric we gave it, or when we're not looking, it guts the tests.
The discipline needs to be structural, not personal. You cannot out-willpower a system that is totally optimized to make you say yes.
> not only did the thing it made not end up being shippable
The difference between then and now is that often with the latest models, it is shippable without bugs within a couple of LLM reviews.
I’m ok doing the work of a dev manager while holding the position of developer.
I’m sure there was someone that once said “The washing machine didn’t do a good job, and I wasn’t proud of myself when I was using it”, but that didn’t stop washing machines from spreading to most homes in first-world countries.
It's also good for quickly creating legitimate looking scam and SEO spam sites. When they stop working, throw them away, and create a dozen more. Maintenance is not a concern. Scammers love this new tech.
When I ran this experiment it was pretty exhilarating for a while. Eventually it turned into QA testing the work of a bad engineer and became exhausting. Since I had sunken so much time into it I felt pretty bad afterwards that not only did the thing it made not end up being shippable, but I hadn't benefitted as a human being while working on it. I had no new skills to show. It was just a big waste of time.
So I think the "second way" is good for demos now. It's good for getting an idea of what something can look like. However, in the future I'll be extremely careful about not letting that go on for more than a day or two.