Just editing a text prompt is 5% of the task. The hard part is evaluating. I wou...

Just editing a text prompt is 5% of the task. The hard part is evaluating. I would have tried a different approach:

- the UI should host a list of models

- a list of prompt variants

- and a collection of input-output pairs

The prompt can be enhanced with demonstrations. Then we can evaluate based on string matching or GPT-4 as a judge. We can find the best prompt, demos and model by trying many combinations. We can monitor regressions.

The prompt should be packed with a few labeled examples for demonstrations and eval, just a text prompt won't be enough to know if you really honed it in.