One thing to notice is that the gpt-3.5 bars are blue and the gpt-4 bars are gre...

One thing to notice is that the gpt-3.5 bars are blue and the gpt-4 bars are green. This is because aider uses different prompting strategies for GPT-3.5 and 4.

GPT-3.5 is only able to reliably edit a file by returning a whole new copy of the file with the edits included. This is the "whole" edit format.

GPT-4 is able to use a more efficient "diff" edit format, where it species blocks of code to search and replace.

All of this is described and quantified in more detail in the original aider benchmarking writeup:

https://aider.chat/docs/benchmarks.html

The original article benchmarked both models using both edit formats (and some others). And indeed, gpt-4/whole beats gpt-3.5/whole. But it's very slow and very expensive to ask gpt-4 to return a whole copy of any file that it edits. So it's just much more practical to use the gpt-4/diff, even though it performs a bit worse than gpt-4/whole.

Aider will let you do gpt-4/whole if you'd like to spend the time and money:

  aider --model gpt-4 --edit-format whole

Once OpenAI relaxes the rate limits, I will benchmark gpt-4-1106-preview/whole.