0. ask it to ask you clarifying questions, which it rarely does by default when you're prompting it to do a task; often it has a few questions, which you can efficiently update your original prompt to cover.
1. make it generate a test-suite; have it iteratively generate new tests which don't overlap with the old ones.
2. ask GPT-4 explicitly to identify improvements, edge-cases and bugfixes, and then go through the list one by one having it rewrite for each one; not infrequently, a fancy rewrite will fail the test-suite from #1, but given the failing test-case, GPT-4 can fix it.
3. once it is clean and it's either done the list or you've disapproved the suggestions, and the test-suite is passing, ask it to write a summary design doc at the beginning.
With all this, you're set up for fairly maintainable code: with the test-suite and the up-front design doc, future LLMs can handle it natively and well, and humans should be able to read it easily after #2 has finished, so you don't need to care where it came from or try to track 'taint' through all future refactorings or usage - GPT-4 can write pretty readable human-like code, it just doesn't necessarily do it the best way the first time (also like a human), so you have to apply inner-monologue ideas.
Other than "this is fantastic, thank you!", I have no actual response to your comment, but it would be remiss not to ask for more insights into coding with ai (maybe in a post in your blog?)
I’ve built conviction that code generation only gets useful in the long term when it is entirely deterministic, or filtered through humans. Otherwise it is almost always technical debt. Hence LLM code generation products are a cool toy, but no sensible teams will use them without an amazing “Day 2” workflow.
As an example, in my day job (https://speakeasyapi.dev), we sell code generation products using the OpenAPI specification to generate downstream artefacts (language SDKs, terraform providers, markdown documentation). The determinism makes it useful — API updates propagate continuously from server code, to specifications, then to the SDKs / providers / docs site. There are no breaking changes because the pipeline is deterministic and humans are in control of the API at the start. The code generation itself is just a means to an end : removing boilerplate effort and language differences by driving it from a source of truth (server api routes/types). Continuously generated, it is not debt.
We’ve put a lot of effort into trying to make an LLM agent useful in this context. However giving them control of generated code directly means it’s hard to keep the “no breaking changes”, and “consistency” restrictions that’s needed to make code generation useful.
The trick we’ve landed on to get utility out of an LLM in a code generation task, is to restrict it to manipulating a strictly typed interface document, such that it can only do non-breaking things to code (e.g. adjust comments / descriptions / examples) by making changes through this interface.
We've reached a similar conclusion for refactoring.
The first version of our product (https://grit.io) was entirely LLM-powered. It was very easy to get started with, but reliability was low on enterprise-scale codebases.
Since then, we've switched to a similar approach: using LLMs to manipulate a verifiable interface, but making actual changes through deterministic code.
It's funny because I read the headline and thought "normally, generated code is very low technical debt because if you do find a bug, it's easy to fix it once in the code generator and all instances will be squashed", but the article is of course about AI, the topsy turvy world, where it's exactly the opposite.
Yes, no longer than about a year ago (yes I know copilot is older but not literally everyone was talking about this before gpt), code generation was compilation (or transpilation which some people say if it’s x->js) or some type of template/model driven generation where you input a formal(sdl etc) spec and code gets generated. Seems now every post about code generation is about AI. It would be good if the latter would read some materials from the former, as this field is quite old and, imho, interesting. More interesting than coercing gpt into doing something better, it won’t be consistent anyway.
I think the trick is keep generating the code from your configuration on regular basis, if not every build. Is what we used to do with generating code for our front-end based on entities in Java decade ago. With enough test coverage for testing correctness.
Makes more sense to just not commit generated code without understanding it first. As has been said many times, writing the code was never the hard part. Software devs are still responsible for the integrity of implementation details.
Yeah, I like the sentiment here, but I wouldn't pass review on a piece of code that had the "@generated" tag on it to absolve either the author or me as the reviewer of actually understanding what it does.
At the end of the day, whether you copy-pasted the code from StackOverflow, let Copilot generate it or M-X-Butterfly'd it in yourself doesn't make a difference as you still need to understand thoroughly what it does.
People usually commit code that is written by code assistants intermixed with their own changes. You would not be able to track it, even on first commit.
We can use a fine-grained commit style to distinguish hand-written and machine-written lines of code. I like this style for command-line driven AI tooling, where my tool is making bigger code changes (completing a function or file). In this case, it's easy to include a "git commit..." command in the tool.
Unfortunately, it doesn't work with GitHub Copilot's style, which assumes really fine-grained interaction.
Honestly if you want to generate less technical debt then writing a function https on top of an url parser is probably the wrong thing to do.
I mean `urlparse(url).scheme = 'https'` is so literal and straightforward that it really ought not be it's own separate function. Adding a wrapper that does nothing but add more code and rethrow the same error is just generating technical debt by writing a function that ought not be.
Arguably all code is technical debt, but so far the world isn't kind enough to bend to our will without any code.
0. ask it to ask you clarifying questions, which it rarely does by default when you're prompting it to do a task; often it has a few questions, which you can efficiently update your original prompt to cover.
1. make it generate a test-suite; have it iteratively generate new tests which don't overlap with the old ones.
2. ask GPT-4 explicitly to identify improvements, edge-cases and bugfixes, and then go through the list one by one having it rewrite for each one; not infrequently, a fancy rewrite will fail the test-suite from #1, but given the failing test-case, GPT-4 can fix it.
3. once it is clean and it's either done the list or you've disapproved the suggestions, and the test-suite is passing, ask it to write a summary design doc at the beginning.
With all this, you're set up for fairly maintainable code: with the test-suite and the up-front design doc, future LLMs can handle it natively and well, and humans should be able to read it easily after #2 has finished, so you don't need to care where it came from or try to track 'taint' through all future refactorings or usage - GPT-4 can write pretty readable human-like code, it just doesn't necessarily do it the best way the first time (also like a human), so you have to apply inner-monologue ideas.