Generating Code Without Generating Technical Debt?

gwern · on Nov 10, 2023

The major tips I would add to this are:

0. ask it to ask you clarifying questions, which it rarely does by default when you're prompting it to do a task; often it has a few questions, which you can efficiently update your original prompt to cover.

1. make it generate a test-suite; have it iteratively generate new tests which don't overlap with the old ones.

2. ask GPT-4 explicitly to identify improvements, edge-cases and bugfixes, and then go through the list one by one having it rewrite for each one; not infrequently, a fancy rewrite will fail the test-suite from #1, but given the failing test-case, GPT-4 can fix it.

3. once it is clean and it's either done the list or you've disapproved the suggestions, and the test-suite is passing, ask it to write a summary design doc at the beginning.

With all this, you're set up for fairly maintainable code: with the test-suite and the up-front design doc, future LLMs can handle it natively and well, and humans should be able to read it easily after #2 has finished, so you don't need to care where it came from or try to track 'taint' through all future refactorings or usage - GPT-4 can write pretty readable human-like code, it just doesn't necessarily do it the best way the first time (also like a human), so you have to apply inner-monologue ideas.

jstummbillig · on Nov 11, 2023

Other than "this is fantastic, thank you!", I have no actual response to your comment, but it would be remiss not to ask for more insights into coding with ai (maybe in a post in your blog?)

gwern · on Nov 12, 2023

https://gwern.net/tla#effective-gpt-4-programming

ThomasRooney · on Nov 10, 2023

I’ve built conviction that code generation only gets useful in the long term when it is entirely deterministic, or filtered through humans. Otherwise it is almost always technical debt. Hence LLM code generation products are a cool toy, but no sensible teams will use them without an amazing “Day 2” workflow.

As an example, in my day job (https://speakeasyapi.dev), we sell code generation products using the OpenAPI specification to generate downstream artefacts (language SDKs, terraform providers, markdown documentation). The determinism makes it useful — API updates propagate continuously from server code, to specifications, then to the SDKs / providers / docs site. There are no breaking changes because the pipeline is deterministic and humans are in control of the API at the start. The code generation itself is just a means to an end : removing boilerplate effort and language differences by driving it from a source of truth (server api routes/types). Continuously generated, it is not debt.

We’ve put a lot of effort into trying to make an LLM agent useful in this context. However giving them control of generated code directly means it’s hard to keep the “no breaking changes”, and “consistency” restrictions that’s needed to make code generation useful.

The trick we’ve landed on to get utility out of an LLM in a code generation task, is to restrict it to manipulating a strictly typed interface document, such that it can only do non-breaking things to code (e.g. adjust comments / descriptions / examples) by making changes through this interface.

rattray · on Nov 10, 2023

+1, I work at a similar company (https://stainlessapi.com) and have had the exact same conclusions. We use LLMs in similar ways.

Well said @ThomasRooney.

There may be other contexts where pure LLM codegen could work well, but I haven't really encountered them personally yet.

morgante · on Nov 11, 2023

We've reached a similar conclusion for refactoring.

The first version of our product (https://grit.io) was entirely LLM-powered. It was very easy to get started with, but reliability was low on enterprise-scale codebases.

Since then, we've switched to a similar approach: using LLMs to manipulate a verifiable interface, but making actual changes through deterministic code.

formerly_proven · on Nov 10, 2023

It's funny because I read the headline and thought "normally, generated code is very low technical debt because if you do find a bug, it's easy to fix it once in the code generator and all instances will be squashed", but the article is of course about AI, the topsy turvy world, where it's exactly the opposite.

anonzzzies · on Nov 11, 2023

Yes, no longer than about a year ago (yes I know copilot is older but not literally everyone was talking about this before gpt), code generation was compilation (or transpilation which some people say if it’s x->js) or some type of template/model driven generation where you input a formal(sdl etc) spec and code gets generated. Seems now every post about code generation is about AI. It would be good if the latter would read some materials from the former, as this field is quite old and, imho, interesting. More interesting than coercing gpt into doing something better, it won’t be consistent anyway.

gbtw · on Nov 10, 2023

I think the trick is keep generating the code from your configuration on regular basis, if not every build. Is what we used to do with generating code for our front-end based on entities in Java decade ago. With enough test coverage for testing correctness.

butlike · on Nov 10, 2023

This got me thinking. What about two new docblock tags? @handwritten @generated

The first, if used, implies the code can be contextualized through another employee at the office (the one who wrote it).

The second, if used, implies you may not be able to get an answer as to why the code was written as it was, since an LLM generated it.

sublinear · on Nov 10, 2023

Makes more sense to just not commit generated code without understanding it first. As has been said many times, writing the code was never the hard part. Software devs are still responsible for the integrity of implementation details.

lvncelot · on Nov 10, 2023

Yeah, I like the sentiment here, but I wouldn't pass review on a piece of code that had the "@generated" tag on it to absolve either the author or me as the reviewer of actually understanding what it does.

At the end of the day, whether you copy-pasted the code from StackOverflow, let Copilot generate it or M-X-Butterfly'd it in yourself doesn't make a difference as you still need to understand thoroughly what it does.

klhutchins · on Nov 10, 2023

Half joking but I try to make sure to ask the AI to "explain line 5 - 20 as if I am new to coding" and throw that in a comment

sublinear · on Nov 10, 2023

Yup. A bad comment can be worse for maintenance than bad code without any comments.

Manfred · on Nov 10, 2023

People usually commit code that is written by code assistants intermixed with their own changes. You would not be able to track it, even on first commit.

joshuanapoli · on Nov 10, 2023

We can use a fine-grained commit style to distinguish hand-written and machine-written lines of code. I like this style for command-line driven AI tooling, where my tool is making bigger code changes (completing a function or file). In this case, it's easy to include a "git commit..." command in the tool.

Unfortunately, it doesn't work with GitHub Copilot's style, which assumes really fine-grained interaction.

contravariant · on Nov 10, 2023

Honestly if you want to generate less technical debt then writing a function https on top of an url parser is probably the wrong thing to do.

I mean `urlparse(url).scheme = 'https'` is so literal and straightforward that it really ought not be it's own separate function. Adding a wrapper that does nothing but add more code and rethrow the same error is just generating technical debt by writing a function that ought not be.

Arguably all code is technical debt, but so far the world isn't kind enough to bend to our will without any code.

gumballindie · on Nov 11, 2023

I dont understand this fixation with procedural code generation. It doesnt work. Stop spamming us please.

albertzeyer · on Nov 11, 2023

The generated code is still bad.

* Why would it print a warning to stdout for invalid input?

* The function expects an URL. So if it gets sth else, it should not catch that exception on its own. The exception is right.

ThePowerOfFuet · on Nov 10, 2023

Subliminal advertising for Bunq in the header image.