Just editing a text prompt is 5% of the task. The hard part is evaluating. I would have tried a different approach:
- the UI should host a list of models
- a list of prompt variants
- and a collection of input-output pairs
The prompt can be enhanced with demonstrations. Then we can evaluate based on string matching or GPT-4 as a judge. We can find the best prompt, demos and model by trying many combinations. We can monitor regressions.
The prompt should be packed with a few labeled examples for demonstrations and eval, just a text prompt won't be enough to know if you really honed it in.
- the UI should host a list of models
- a list of prompt variants
- and a collection of input-output pairs
The prompt can be enhanced with demonstrations. Then we can evaluate based on string matching or GPT-4 as a judge. We can find the best prompt, demos and model by trying many combinations. We can monitor regressions.
The prompt should be packed with a few labeled examples for demonstrations and eval, just a text prompt won't be enough to know if you really honed it in.